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ABSTRACT 

Experimental  response  functions  are  often  approximated  by  simple 
empirical  functions  such  as  polynomials.  Several  methods  for  modeling  such 
responses  which  take  into  account  this  approximate  nature  are  described  and 
are  shown  to  be  essentially  equivalent.  The  models  all  involve  a  Bayesian 
analysis  which  reflects  prior  experimental  belief  about  the  ability  of  the 
empirical  approximation  to  represent  the  true  respom  is  function.  The  models 
are  also  related  to  Kalman  filters.  Implications  of  the  models  for 
statistical  inference  are  examined  with  particular  attention  to  estimating  the 
response  function.  Numerical  examples  help  illustrate  the  models.  A  general 
predictive  check  is  developed  to  examine  the  consistency  of  the  model  with  the 
observed  data.  ^ 
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SIGNIFICANCE  AND  EXPLANATION 


Scientists  and  engineers  often  wish  to  predict  the  value  of  a  response 
variable  as  a  function  of  one  or  more  input  variables.  When  the  form  of  the 
response  function  is  unknown  or  is  very  complicated,  it  is  common  to 
approximate  the  true  response  function  by  an  empirical  graduating  function 
such  as  a  polynomial  whose  coefficients  are  estimated  from  observed  data. 

Several  statistical  models  have  been  proposed  recently  which  reflect  the 
approximate  nature  of  empirical  graduating  functions.  The  models  all  follow  a 
Bayesian  approach  which  incorporates  the  experimenter's  prior  beliefs  as  to 
the  likely  (in) adequacy  of  the  graduating  function  to  represent  the  true 
response  function.  The  resulting  prediction  equation  has  two  components:  an 
estimated  graduating  function  and  a  second  function  which  may  be  interpreted 
as  the  estimated  bias  induced  by  the  particular  choice  of  graduating 
function.  The  inclusion  of  the  bias  component  typically  allows  the  prediction 
equation  to  follow  the  observed  data  more  closely  than  the  graduating  function 
alone.  The  extent  to  which  the  prediction  equation  will  deviate  from  the 
graduating  function  depends  largely  on  the  experimenter's  prior  beliefs:  if 
these  express  perfect  confidence  in  the  ability  of  the  graduating  function  to 
model  the  response,  then  the  bias  will  be  estimated  to  be  0  and  the  prediction 
equation  will  contain  only  a  graduating  function;  on  the  other  hand,  if  the 
graduating  function  is  thought  to  be  seriously  inadequate,  the  prediction 
equation  will  approximately  interpolate  the  observed  data  points. 

An  example  is  given  in  which  yield  of  a  chemical  process  is  thought  to  be 
roughly  a  linear  function  of  the  reaction  temperature.  The  Bayesian  methods 
produce  prediction  equations  which  show  an  overall  linear  increase  of  yield 
with  temperature,  but  with  local  deviations  about  the  trend  line  to  more 
closely  follow  the  observed  experimental  data. 

The  responsibility  for  the  wording  and  views  expressed  in  this  descriptive 
summary  lies  with  MRC,  and  not  with  the  author  of  this  report. 


BAYESIAN  MODELS  FOR  RESPONSE  SURFACES 
OF  UNCERTAIN  FUNCTIONAL  FORM 

David  M.  Steinberg 

1 .  Introduction 

Many  scientific  investigations  are  designed  to  explore  the  relationship 
between  a  response  variable  Y  and  a  set  of  input,  or  explanatory  variables, 

, C2' • • • ' 5^*  The  inputs  may  be  continuous  variables,  such  as  dosage  or 
time,  or  categorical  variables,  such  as  sex  or  batch  number.  Information  on 
how  the  response  is  related  to  the  inputs  is  obtained  by  conducting  an 
experiment:  n  combinations  of  the  inputs  are  specified  and  at  each  of  these 
combinations  an  experiment  is  conducted  and  the  resulting  value  of  the 
response  is  observed. 

Sometimes  the  physical  nature  of  the  problem  suggests  a  specific 
functional  form  linking  the  response  to  the  explanatory  variables.  However, 
in  many  investigations  the  functional  nature  of  the  response  is  either  unknown, 
or  is  too  complicated  to  provide  a  useful  representation.  A  common  strategy 
in  such  situations  is  to  approximate  the  true,  but  unknown,  response  function 
by  a  simple  graduating  function,  typically  a  low  degree  polynomial.  For 
example,  one  might  employ  the  first  order  approximation: 

(1.1a)  Y(C)  -  80  +  18^, 

where  the  q  variables  X^,...,Xg  are  currently  thought  to  be  appropriate 

functions  of  the  input  variables  £.  and  8„,8,  ,...,8  are  parameters 

Ik  0  1  q 

which  must  be  estimated.  This  type  of  approximation  is  most  common  when  the 
input  variables  are  continuous  but  is  also  applicable  for  categorical  inputs. 

If  a  first  order  approximation  is  judged  to  be  an  inadequate  representation  of 
the  response  function,  one  might  use  instead  the  second  order  approximation: 
(1.1b)  Y(C)  -  eQ  +  18^+  EEBijXiX.. 


Sponsored  by  the  United  States  Army  under  Contract  No.  DAAG29-80-C-0041. 


When  these  equations  are  used  to  approximate  the  relationship  between  the 
response  and  the  inputs,  we  can  represent  the  experimental  results  by  a  linear 
statistical  models 

(1.2)  T  -  X0  +  e, 

where  Y  denotes  the  nxl  vector  of  observed  responses,  X  is  an  nxp  matrix  of 
regressors  whose  columns  correspond  to  appropriate  powers  of  the  q  functions 
of  the  input  variables,  X,j,...,X  ,  0  is  a  vector  of  parameters  which  must  be 

estimated,  and  e  denotes  the  experimental  error  associated  with  the 
observations.  The  statistical  techniques  of  response  surface  methodology  can 
now  be  used  to  estimate  the  parameters  and  to  analyze  the  behavior  of  the 
response  (see  Box  and  Wilson  [1951],  Box  [1954],  Box  and  Youle  [1955],  Myers 
[1976]  ). 

Because  (1.2)  is  only  an  approximate  model  based  on  an  approximate  form 
for  the  response  function,  it  is  natural  to  wonder  what  the  effect  might  be  of 
analyzing  the  experiment  as  though  the  approximating  model  were  in  fact  the 
true  response  function.  Box  and  Draper  [1959]  found  that  criteria  for 
experimental  design  are  affected  dramatically  when  the  approximate  nature  of 
the  model  is  taken  into  account. 

Recently,  several  authors  have  proposed  generalizations  of  (1.2)  which 
are  designed  to  model  scientific  uncertainty  about  the  form  of  the  true 
response  function.  Smith  [1973]  suggests  a  three  tiered  Bayesian  model 
following  the  general  structure  described  by  Lindley  and  Smith  [1972].  Blight 
and  Ott  [1975]  expand  (1.2)  by  adding  an  extra  term  to  reflect  the  difference 
between  the  approximation  and  the  true  model.  O'Hagan  [1978]  proposes  a 
"localized  regression"  model,  in  which  the  structure  of  (1.2)  is  maintained, 
but  the  parameter  values  are  allowed  to  vary  throughout  the  design  space, 
subject  to  prior  assumptions  about  their  joint  distribution.  Wahba  [1978] 
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advocates  the  use  of  approximating  spline  functions,  and  demonstrates  how 
these  generalize  (1.2)  and  also  how  they  can  be  viewed  as  Bayesian  solutions 
to  the  estimation  problem.  Sacks  and  Ylvisaker  [1978]  suggest  a  model 
similar  in  form  to  that  of  Blight  and  ott,  but  employ  a  frequentist,  rather 
than  a  Bayesian,  estimation  procedure. 

This  peper  will  discuss  the  first  four  approaches,  all  of  which  involve 
Bayesian  analyses.  The  next  section  will  define  the  four  models  and  will 
demonstrate  how  they  are  related  to  one  another.  Sections  3  and  4  will 
describe  how  the  models  can  be  used  to  draw  inferences  about  the  relationship 
between  the  response  variable  and  the  inputs.  Section  5  will  consider  the 
traditional  statistical  test  for  lack  of  fit  (i.e.,  model  inadequacy)  in  the 
context  of  the  Bayesian  models.  Section  6  will  demonstrate  some  similarities 
and  differences  between  these  models  and  the  Kalman  filter.  Some  examples 
will  be  presented  in  Section  7  and  concluding  remarks  are  reserved  for  the 
final  section. 


2.  Models  for  Unknown  Response  Functions 

Smith  [1973]  proposes  a  hierarchical  Bayesian  linear  model  to  represent 
the  relationship  between  a  response  vector  T  and  a  matrix  X  of  associated 
regressor  variables.  The  hierarchical  structure  consists  of  three  tiers  and 
provides  the  mechanism  for  building  uncertainty  about  the  response  function 
into  the  statistical  model.  The  model  is  a  special  case  of  the  general  three 
tiered  Bayesian  model  analyzed  by  Lindley  and  Smith  [1972]  so  that  all  of 
their  results  may  be  applied  here. 

The  first  tier  of  Smith's  model  singly  states  that  the  observation  vector 

T  follows  a  multivariate  normal  distribution  with  mean  vector  0^  and 

2  2 
variance  matrix  o  times  the  identity  matrix,  where  o  indicates  the 

magnitude  of  experimental  error: 

(2.1a)  T/©1  ~  Nl^.o2!). 

The  second  tier  invokes  the  linear  model  structure  by  asserting  that  the 
vector  of  expected  values,  0^,  follows  a  multivariate  normal  distribution 
with  mean  vector  X02>  where  @2  is  a  vector  of  regression  coefficients  and 
corresponds  directly  to  0  in  equation  (1.2): 

(2.1b)  0^02  ~  N(X02,¥). 

Smith  examines  in  detail  only  the  case  where  X$2  represents  polynomial 
regression  with  a  single  regressor,  but  the  generalization  to  arbitrary  linear 
models  is  straightforward.  The  variance  matrix  V  indicates  the  experimenter's 
a  priori  confidence  in  the  adequacy  of  the  linear  model.  Note  that  if  the 
elements  of  ▼  are  all  quite  small,  then  the  model  is  claiming  that  the 
expected  value  vector,  0^,  follows  the  linear  model  X@2  very  closely;  i.e., 
the  linear  model  is  assumed  to  be  a  very  good  representation  of  the  true 
response  function.  In  fact,  the  limiting  case  of  Smith's  model  as  V  converges 
to  a  0  matrix  yields  precisely  the  same  analysis  that  results  from  treating 
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(1.2)  as  an  exact  model  for  the  experiment.  On  the  other  hand,  if  the 
elements  of  V  are  rather  large,  this  reflects  prior  belief  that  the  true 
response  may  deviate  considerably  from  the  linear  model,  even  though  it  may  be 
our  best  current  guess  for  the  response  function.  Just  what  constitutes 
‘‘large''  or  "small”  entries  in  V  will  be  left  vague  for  now;  however,  a 
reparameterization  which  makes  these  terms  more  precise  will  be  described  in 
section  3. 

The  final  tier  of  the  model  assigns  a  prior  distribution  to  the 
regression  parameters: 

(2.ic)  e2  ~  n(#0,v1). 

A  diffuse  prior  is  often  deemed  appropriate  for  the  regression  parameters  and 
this  can  be  achieved  by  considering  limiting  forms  as  V1 1  converges  to  0. 

Blight  and  Ott  [1975]  propose  a  model  which  represents  the  response  as  a 
sum  of  three  components: 

(2.2)  Response  »  Low  degree  polynomial  approximation 

+  deterministic  error  (bias) 

+  random  (experimental)  error. 

The  first  and  last  terms  are  identical  to  the  two  terms  in  (1.2);  what 
distinguishes  this  model  from  (1.2)  is  the  second  term,  which  is  an  explicit 
statement  of  the  approximate  nature  of  the  polynomial.  Actually,  this  type  of 
statistical  representation  is  well  established:  it  is  common  practice  to 
analyze  estimators  post-hoc  by  taking  the  expected  value  of  the  squared 
difference  between  the  estimator  and  the  quantity  to  be  estimated  and  then  to 
decompose  this  expected  mean  square  error  into  variance  and  bias  components. 
Blight  and  Ott  merely  suggest  that  bias,  as  well  as  variance,  be  incorporated 
into  the  original  statistical  model. 
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Mathematically,  Blight  and  Ott's  model  can  be  written: 

(2.3)  T  -  X0  +  f|  +  e. 

The  three  terms  on  the  right  hand  side  of  (2.3)  correspond  to  the  respective 
components  of  (2.2).  Blight  and  Ott,  like  Smith,  reserve  close  scrutiny  for 
the  special  case  where  only  one  regressor  variable  is  used  in  the  polynomial 
approximation,  but  there  is  no  difficulty  in  generalizing  to  arbitrary  linear 
models.  Observe  that  (2.3)  is  identical  to  (1.2)  but  for  the  addition  of  the 
middle  term,  n,  and  the  assumption  that  this  term  permits  an  exact 
representation  of  T,  so  that  an  equals  sign  is  now  justified. 

Blight  and  Ott  complete  their  model  specification  by  making  the  following 
distributional  assumptions : 

(2.4a)  0  -  N(0o,fl) 

(2.4b)  n  -  N(0,li-1 ) 

(2.4c)  e  -  N(0,o2I). 

In  addition,  it  is  assumed  that  0,  n  and  C  are  distributed  independently. 

A  simple  rationale  underlies  the  distributional  assumptions.  Equation 
(2.4a)  provides  the  prior  distribution  for  the  regression  parameters  and  is 
directly  analogous  to  (2.1c)  in  Smith's  model.  Again,  a  diffuse  prior 
corresponds  to  considering  limiting  forms  as  Q  1  tends  to  a  0  matrix.  The 
assumptions  on  e  are  identical  to  those  in  the  standard  linear  model,  that 
the  experimental  errors  are  independent  and  normally  distributed  with  zero 
mean  and  common  variance  a2. 

The  distribution  of  the  bias  term  given  in  (2.4c)  is  justified  by 
appealing  to  prior  belief.  Recall  that  the  bias  term  represents  that  part  of 
the  response  function  not  captured  by  the  approximating  polynomial.  Since  the 
approximating  polynomial  typically  represents  the  best  current  guess  as  to  how 
the  response  depends  on  the  inputs,  it  is  reasonable  to  assign  the  bias  a 
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prior  mean  of  0.  The  variance  matrix  of  the  bias  suggests  the  possible 
severity  of  the  bias}  i.e.,  the  prior  adequacy  assumed  for  the  model.  The 
diagonal  elements  of  I>  1  can  be  interpreted  as  reflecting  the  magnitude  of 
the  bias  at  the  respective  design  points.  The  off-diagonal  elements  reflect 
prior  assumptions  about  how  similar  the  bias  is  likely  to  be  at  corresponding 
pairs  of  design  points.  This  is  closely  related  to  prior  convictions  about 
the  smoothness  of  the  response  function,  since  a  response  function  which  is 
smooth  will  have  similar  biases  at  proximate  design  points. 

Both  Smith  and  Blight  and  Ott  propose  to  account  for  the  approximate 
nature  of  the  linear  model  (1.2)  by  adding  an  extra  component.  Smith  suggests 
a  hierarchical  model,  which  assumes  that  the  expected  values  of  the 
observations  need  not  exactly  follow  the  linear  model.  Blight  and  Ott  add  a 
bias  term  which  explicitly  represents  the  difference  between  the  true  response 
function  and  the  linear  model.  In  both  cases,  a  covariance  matrix  egresses 
the  experimenter's  prior  belief  about  the  degree  to  which  the  approximating 
model  may  be  inadequate.  It  is  clear  that  the  two  approaches  share  a  similar 
spirit.  He  now  show  just  how  similar  they  are. 

Theorem  2. 1  The  Smith  model  and  the  Blight-Ott  model  are  mathematically 
equivalent. 

Proof}  The  proof  is  very  single  and  relies  on  a  trivial  re-writing  of 
Smith's  model.  He  will  simply  write  each  of  the  first  two  stages  in  Smith's 
model  as  the  sum  of  a  deterministic  term  (the  expected  value)  plus  a  random 
term  with  an  appropriate  covariance  matrix.  (In  fact,  this  is  one  of  the 
methods  employed  by  Lindley  and  Smith  [1972]  in  deriving  some  of  the 
properties  of  their  more  general  hierarchical  model.)  Thus,  we  rewrite 
equation  (2.1a)  as: 

(2.5a)  T  -  +  «,  whr  >  *  -  V'  .  j2i) . 


7 


I 


Similarly,  we  rewrite  equation  (2.1b)  as: 

(2.5b)  01  =  X02  +  f»,  where  H  ~  N(0,V). 

Now,  substituting  (2.5b)  into  (2.5a)  gives: 

(2.5c)  T  =  xfl2  +  n  +  e, 

where  the  distributions  of  it  and  e  are  given  above  and  the  distribution  of 
02  is  given  in  (2.1c),  and  the  three  terms  are  independent.  This  is 
precisely  the  model  suggested  by  Blight  and  Ott,  with  #2  in  place  of  0  and  V 
corresponding  to  L  \  This  completes  the  proof. 

The  mathematical  equivalence  proved  above  is  interesting  in  light  of  the 
somewhat  different  interpretations  suggested  for  the  two  models.  The  model 
inadequacy  approach  advocated  by  Smith  and  the  bias  approach  used  by  Blight 
and  Ott  would  seem  to  be  just  two  sides  of  the  same  coin. 

O'Hagan  [1978]  suggests  a  different  way  to  modify  (1.2)  to  reflect 
uncertainty  as  to  the  form  of  the  response  function.  He  argues  that,  while 
(1.2)  may  be  adequate  to  describe  the  response  function  in  the  immediate 
neigl  _  „rhood  of  any  particular  point  x  =  (X  ^  (€),...  ,X(Q) ,  it  is  unlikely 
to  be  valid  over  the  entire  range  of  inputs  which  might  be  used.  This  leads 
him  to  generalize  (1.2)  by  allowing  the  parameter  vector  0  to  be  a  function 
of  x.  The  manner  in  which  0  varies  with  x  is  not  specified  in  terms  of  an 
exact  functional  form;  rather,  a  prior  probability  distribution  is  used  to 
describe  the  experimenter's  beliefs  as  to  how  the  parameters  may  change  from 
one  point  to  another.  O'Hagan  calls  this  the  "localized  regression  model". 

O'Hagan  formally  defines  the  localized  regression  model  by  specifying  the 
appropriate  distributional  assumptions  for  each  point  x  in  the  design  space. 
Denoting  by  Yx  an  observation  at  the  point  x,  he  assumes  that: 

(2.6a)  Yx/0(x)  -  N(f(x)'0(x),O2), 

where  £(x)  is  a  vector  whose  elements  are  the  appropriate  regressor  variables 
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evaluated  at  x  and  a  is  the  experimental  error  variance.  Further,  he 
assumes  that: 

(2.6b)  ft(x)/bQ  ~  N(b0,B0>. 

We  can  interpret  bQ  as  the  parameters  of  a  global  regression  function  about 
which  there  is  local  variation:  Bq  is  a  corresponding  covariance  matrix  whose 
entries  indicate  how  much  the  regression  parameters  at  any  particular  point 
may  vary  about  those  for  the  global  regression  function.  When  there  is  no 
prior  information  to  suggest  a  specific  global  regression  function.  O' Hagan 
suggests  using  a  vague  prior  distribution  for  b^.  This  is  accomplished  by 
assuming  that: 

(2.6c)  bQ  -  N(b*,kB*), 

and  considering  limiting  forms  as  k  ♦  ».  Finally,  he  assumes  that  the 
parameter  values  for  any  two  points  and  Xj  are  stochastically  related: 

(2.6d>  >{[»(«,)  -  b()][|l(x2)  -  V*/b0}  »  V(x1,x2). 

The  matrix  ▼  in  (2.6d)  plays  a  similar  role  to  the  matrix  I.  1  in  the 
Blight-Ott  model.  O'Hagan  uses  this  matrix  to  reflect  prior  belief  about  how 
much  the  parameters  (less  the  global  parameter  values)  may  vary  from  one  point 
to  another,  while  Blight  and  Ott  wish  to  model  how  much  the  response  function 
itself  (less  the  polynomial  approximation)  may  vary.  In  general,  the  entries 
of  V  are  related  to  prior  convictions  about  the  smoothness  of  the  response 
function:  assuming  that  the  parameter  values  at  two  design  points  are  highly 
correlated  is  tantamount  to  assuming  that  a  single  approximating  polynomial  is 
adequate  to  describe  the  response  function  throughout  the  range  between 
them.  This  idea  can  be  used  to  suggest  specific  forms  for  V.  For  example,  in 
the  case  of  polynomial  regression  with  a  single  regressor  variable,  O'Hagan 
suggests  using:  Vfx^x^  ■  pUx^  -  x2|)BQ,  where  p(d)  is  a  monotone 

decreasing  function  of  d  and  p(0)»1.  For  this  choice,  0(x)  is  a  second- 
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order,  stationary  stochastic  process  on  the  real  line  and  the  rate  at  which 
p(d)  decreases  reflects  prior  assumptions  about  the  smoothness  of  the 
response  function.  The  slower  the  rate  of  decrease,  the  more  highly 
correlated  will  be  the  parameter  values  at  similar  points,  and  hence  the 
greater  the  degree  of  smoothness  assumed  for  the  response  function. 

From  the  above  comments,  it  is  clear  that  O' Hagan's  model  shares  some 
common  ground  with  Blight  and  ott's  model.  In  fact,  O'Hagan  observes  that  the 
Blight-Ott  model  "shows  many  similarities"  to  his  own.  However,  concentrating 
on  the  specific  covariance  function  analyzed  in  detail  by  Blight  and  Ott,  he 
continues  that  "their  model  can  be  regarded  as  a  special  case  of  ours  in  that 
a  localized  regression  term  (locally  constant)  is  added  to  a  global  term 
(polynomial).  Their  analysis  relies  on  a  special  covariance  structure  for 
[ n]  and. ..their  estimation  of  the  global  term  takes  no  account  of  the  average 
effect  of  the  localized  disturbance  over  the  region  of  interest."  By 
considering  Blight  and  Ott's  model  in  the  more  general  form  described  in  (2.3) 
and  (2.4),  we  now  show  that  it  is  actually  equivalent  to  O'Hagan's  model.  Of 
course,  by  Theorem  2.1,  this  is  also  true  of  Smith's  model. 


Theorem  2.2;  Under  the  model  specification  of  (2.6a-d),  the  observed  data 


vector  T  is  described  by  the  model  (2.3)-(2.4),  with  bp  in  place  of  0  and 
1.^  »  f(*i)'V(*i,xJf(z^). 

Proof :  The  proof  is  very  similar  to  that  of  Theorem  2.1.  We  begin  by 


rewriting  (2.6a)  as: 

(2.7a)  *  0(*)'K*)  +  Ex*  where  ~N(0,o2). 

Now  rewrite  (2.6b)  as: 


(2.7b)  0(x)  -  bQ  +  C(x),  where  £(x)  «■  N(0,BQ). 
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Substituting  (2.7b)  into  (2.7a): 

(2.7c)  Y  =  £(x)'b„  +  £(x)'C(x)  +  C 
x  O  x 

=  f(x)'b.  +  h  +  £  ,  where  n  =  f(x)'C(x). 

0  x  x  x 

Thus  the  observation  vector  Y  can  be  written: 

(2.7d)  Y  =  XbQ  +  n  +  e, 

precisely  the  form  of  (2.3).  All  that  remains  to  complete  the  proof  is  to 

show  that  n  has  the  distribution  claimed  in  the  theorem: 

(2.7e)  K { n}  =  0,  because  e{c(x)}  =  0  for  all  x. 

( 2. 7f )  e{[£(x. )'t(x. )] [£(x.)’C(x.)]  '} 

1  i  i  3  3 

=  £(  x.  ) '  e{c(x.  )C(x.  )'  }  £(  x.  ) 
i  1  i  3  1  3 

=  £(x1)'V(xi,x.  )f(x.  ). 

The  last  line  follows  from  combining  assumption  (2.6d)  of  the  model  with  the 
definition  of  £(x)  in  (2.7b). 

The  spline  function  approach,  on  the  surface,  appears  quite  unrelated  to 
the  models  described  above.  However,  results  of  Wahba  [1978]  show  that  it  is 
essentially  an  equivalent  procedure  when  the  regression  coefficients  are 
assigned  a  diffuse  prior. 

Generalized  smoothing  splines  for  the  statistical  problem  posed  here  are 
derived  as  solutions  to  a  problem  in  functional  approximation.  The  solution 


in  the  general  case  exploits  the  structure  of  reproducing  kernel  Hilbert 
spaces  (r.k.h.s.)  (see  Aronszajn  [1950]  for  the  general  theory  of  r.k.h.s.). 


First,  denote  by  the  monomials  which  constitute  the  approximating 

polynomial.  Now  let  H^  be  a  r.k.h.s.  of  functions  defined  on  the  design  space 
which  contains  the  and  has  reproducing  kernel  K(x^,X2>.  It  can  be 

shown  that  Hj^  has  a  representation  as  the  direct  sum  of  span  {^}  and  a 
r.k.h.s.  Hg,  which  has  reproducing  kernel  Q(x^,X2).  Let  Pg  be  the  orthogonal 
projection  operator  from  H^  onto  Hg.  Then  the  generalized  smoothing  spline 
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g  .  is  defined  as  the  solution  to  the  problems  find  g  £  H,,  to  minimize 
n  a  A 

-1  n  ■>  1 
(2.8)  n  l  [g(x  )  -  Y  ]  +  A,Png,v 

i=1  C 

where  the  summation  is  over  the  n  observed  data  points  and  the  latter  term  is 
the  squared  norm  (in  HK)  of  the  projection  of  g  onto  Hq  times  a  smoothing 
parameter  X. 

Much  of  the  work  on  smoothing  splines  has  focused  on  the  case  where  the 

design  space  is  one  dimensional,  \  j=1,...,p,  and 

P  (g)  =  d^/dx15.  In  this  case,  it  is  well  known  that  g  .  is  a  polynomial 
Q  n#  a 

spline  of  degree  2P-1  and  is  uniquely  determined  provided  the  data  cannot  be 

exactly  interpolated  by  the  approximating  polynomial  (see  Wahba  [1978]).  A 

common  choice  for  p  has  been  p=2,  in  which  case  lP^(g)|2  *  /( g^2*(x))2dx  and 

has  a  direct  interpretation  as  a  measure  of  the  smoothness  of  the  solution. 

The  choice  of  X  then  controls  the  tradeoff  between  how  smooth  the  solution 

will  be  and  how  clos'  ly  it  will  match  the  observed  data. 

The  solution  g  .  to  (2.8)  will  always  include  the  postulated 
n,  A 

approximating  polynomial,  since  Pg($J  =  j«1,...,p.  Thus,  the  polynomial 

can  be  included  in  the  solution  at  no  cost  to  the  second  term  in  (2.8),  with 
the  coefficients  chosen  to  minimize  the  sum  of  squares.  This  prompts  Wahba  to 
suggest  that  "spline  smoothing  is  an  appropriate  solution  to  the  problem 
arising  when  one  wants  to  fit  a  given  set  of  regression  functions  to  the  data 
but  one  also  wants  to  'hedge'  against  model  errors". 

Wahba  [1978]  proves  the  following  theorem  which  relates  spline  smoothing 
to  Bayesian  estimation  of  a  stochastic  process. 

Theorem  2.3:  Suppose  the  true  response  function  is  g(x),  so  that  the  i'th 
data  point  is 


Y  =  q(  x  )  +  e 
i  9V  i  i 


where  C  *  (e  ,....,€  )'  ~  N(0,o^I).  bet  the  prior  distribution  of  g(x)  be 
1  n 
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the  same  as  that  of  the  stochastic  process 

(2.9)  T  (x)  =  f  5.*.(x)  +  b1/2Z(x), 

*  j=1  3  3 

where  0  =  (0,,...,0  )'  N(0.,CD«  b>0  is  fixed  and  Z(x)  is  a  zero  mean 

Ip  0 

Gaussian  stochastic  process  with  E {z( x^  )Z( x^ ) }  =  Qfx^jX^).  Then  for  any 
fixed  point  x, 

gn,x(x)  =  fete  Eslg<*)/T=y}» 

2 

where  A=0  /nb  and  E^  denotes  expectation  with  respect  to  the  posterior 

distribution  of  g(x)  given  the  prior  (2.9).  Thus  the  smoothing  spline 

solution  g  .  is  the  limiting  posterior  expectation  of  the  response  function 
Il|  A 

given  (2.9)  when  the  prior  distribution  of  the  parameters  in  the  approximating 
polynomial  is  made  diffuse. 

The  characterization  of  spline  smoothing  in  Theorem  2.3  as  a  form  of 
Bayesian  estimation  suggests  a  similarity  with  the  previous  models.  We  prove 
this  in  the  following  theorem. 

Theorem  2.4;  Under  the  prior  specification  (2.9)  of  the  last  theorem,  the 
prior  distribution  of  the  data  vector  Y  is  given  by  the  Blight-Ott  model 
((2.3)  and  (2.4))  with  Q  =  Cl  and  with  L  V  ^  =  bQtx^.x^). 

Proof :  The  i'th  observation  is  ■  g(x^)  +  e^.  Then,  given  (2.9),  the 

prior  distribution  for  the  i'th  observation  is  the  same  as  the  distribution  of 


J/j*j(Xi 

3=1  J  J 


)  +  b1/2z(x.  )  +  e. 

i  i 


■  f  S-iW  +  \  +  v 

j=i  J  J  1 


The  full  data  vector  Y  thus  has  a  prior  distribution  identical  to  the 
distribution  of 

x0  +  n  +  e 

l 

where  X  is  an  nxp  matrix  with  X  =  $.(x, ),  0  =  ( 0, , . . . , 0  )  and 

i»J  3  i  1  P 

*1  =  ( , • • • , H  ) ' •  The  prior  distributions  of  0,  n,  and  C  are  easily  seen  to 
be  those  claimed  in  the  theorem. 
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The  essence  of  Theorems  2.3  and  2.4  Is  that  spline  estimation  can  be 
derived  as  a  Bayes  estimate  of  the  response  function,  and  the  prior 
specification  which  corresponds  to  spline  estimation  is  essentially  that 
proposed  by  Blight  and  Ott  (or,  equivalently,  by  Smith  or  O'Hagan).  However, 
two  qualifications  are  necessary.  First,  the  Bayesian  model  defined  by  (2.9) 
prescribes  the  prior  distribution  of  the  response  function  for  all  possible 
factor  settings.  This  is  also  true  of  O' Hagan's  localized  regression  model. 
The  Smith  model  of  (2.1)  and  the  Blight-Ott  model  of  (2.3)  and  (2.4)  are 
limited  to  the  n  observed  data  points.  However,  the  natural  extension  of 
their  models  to  draw  inferences  about  the  response  at  arbitrary  factor 
combinations  corresponds  precisely  to  the  priors  of  (2.6)  and  (2.9).  Thus 
these  "complete"  priors,  although  not  stated  directly  by  Smith  or  by  Blight 
and  Ott,  are  nonetheless  implicit  in  their  models.  This  will  be  examined  in 
detail  in  section  3. 

The  second  difference  between  the  model  formulations  concerns  the  matrix 
L  *.  In  the  Smith  and  Blight-Ott  models,  the  entries  of  this  matrix  reflect 
the  experimenter's  prior  belief  as  to  the  magnitude  of  bias  at  the  design 
points  and,  in  theory,  they  are  restricted  only  by  the  requirement  that 
I,  1  be  a  legitimate  covariance  matrix.  This  is  not  so  for  O'Hagan's  model  or 
for  spline  estimation.  The  form  of  this  matrix  for  O'Hagan's  model  was 
derived  in  Theorem  2.2  and  was  found  to  depend  on  the  form  of  the 
approximating  polynomial.  The  covariance  matrix  for  spline  estimation,  given 
in  Theorem  2.4,  derives  from  the  r.k.h.s.  structure  in  conjunction  with  the 
choice  of  the  approximating  polynomial.  The  problem  must  be  phrased  in  terms 
of  an  overall  r.k.h.s.  hr  which  in  turn  must  be  decomposable  into  the  direct 
sum  of  span  and  a  r.k.h.s.  Hg.  The  covariance  matrix  is  then  determined 

by  the  kernel  Qfm^Xg)  of  Hg.  This  restriction  may  result  in  sane  loss  of 
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generality*  In  actual  practice,  the  Bayesian  interpretation  of  L  1  as  a 
covariance  matrix  has  played  a  secondary  role  in  spline  fitting.  The  primary 
concern  has  been  to  define  an  appropriate  smoothness  penalty.  The  smoothness 
penalty  determines  the  appropriate  r.k.h.s.  and  L  '  is  then  obtained  as  the 
Grammian  matrix  of  appropriate  elements  of  the  r.k.h.s.  (see  Kimeldorf  and 
Wahba  [1971]  for  details).  This  is  a  valuable  reminder  that  the  basic 
motivation  for  using  smoothing  splines  derives  from  ideas  in  functional 
approximation,  not  from  its  characterization  as  a  Bayesian  technique,  although 
the  latter  has  been  emphasized  here. 
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3.  Inference  for  Future  Observations:  Prediction  and  Estimation 

This  section  will  discuss  how  statistical  inferences  may  be  drawn  from 
the  Bayesian  models  of  section  2.  The  analysis  will  focus  on  the  problem  of 
predicting  the  response  for  various  combinations  of  the  input  variables* 

This,  after  all,  is  the  ultimate  objective  of  the  response  surface  framework 
described  in  section  1.  However,  we  will  also  consider  questions  of  inference 
regarding  the  regression  coefficients  themselves,  in  particular  as  they  relate 
to  predicting  the  response. 

The  notation  used  will  be  that  of  the  Smith  model  (equation  2.1), 
although  both  the  Smith  model  and  the  Blight  and  Ott  models  will  prove  useful 
in  deriving  some  of  the  results.  Of  course,  they  will  be  equally  valid  for 
both  models,  as  well  as  for  O' Hagan's  model  and  for  the  spline  formulation 
when  a  diffuse  prior  is  used  for  the  regression  coefficients.  Most  of  the 
results  will  be  stated  as  theorems.  However,  the  proofs  of  the  theorems  will 
be  deferred  to  the  appendix  at  the  conclusion  of  this  report. 

Consider  first  the  problem  of  predicting  the  response  Y  when  the  input 
variables  are  fixed  at  , . . . , First,  let  us  convert  the  inputs  to 

the  more  meaningful  scale  of  X^,...X^.  We  will  then  derive  results  for 
predicting  Y(c)  indirectly,  in  terms  of  Y(x),  where  jt=(X1 (c) , . . . ,X^(c) ) . 

For  the  Bayesian  models  of  section  2,  a  natural  method  of  prediction  is 
to  extend  the  model  to  include  the  prediction  site,  as  well  as  the  design 
points,  and  then  to  find  the  conditional  expectation  of  Y  at  the  prediction 
site  given  the  observed  data.  Let  Y  denote  the  n  observed  experimental 
responses  and  let  Yx  denote  a  (as  yet  unobserved)  response  at  the  prediction 
site.  Similarly,  denote  by  0^  and  0^  the  respective  first  tier  expected 
values  of  Y  and  Yx.  Assume  that  the  observational  error  associated  with  Yx, 
like  those  in  the  experiment,  is  independent  and  normally  distributed  with 
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mean  0  and  variance  a  .  Denote  the  regressor  variables  for  Y  by  an  nxy 
matrix  X  and  denote  those  for  Yx  by  a  Ixp  vector  s'.  Denote  the  covariance 
matrix  of  the  second  tier  by 


where  the  partitioning  corresponds  to  that  of  and  0^.  The  entire  model  is 
now  specified  by: 

o.ia)  (r\Yx)/<e',ex)  -  n( <0*f ex>, o2in+1 ) 

o.ib)  <ei'0x>/°2  ~  N<<(xe2)\(*,e2)’)/v‘) 

(3.1c)  «2  -  N(aQ/V1). 

It  should  be  noted  that  this  is  precisely  the  model  that  would  be  implied  by 
the  "more  complete"  prior  specification  of  (2.9),  when  the  prior  for  the 
regression  coefficients,  02»  is  made  diffuse.  This  explains  the  comment  of 
the  previous  section  that  the  more  complete  prior  is  implicitly  assumed  in  the 
Smith  and  Blight-Ott  models. 

Theorem  3.1:  Under  the  model  specification  of  (3.1),  the  conditional 
expectation  of  Yx,  given  that  the  observed  data  are  Y=y,  is: 

(3.2)  e{y  j/*=y}  =  *’30  +  (▼’  +  z’V^X'  )(o2In  +  V+  XV^’TVy  -  XBQ). 

The  conditional  expectation  of  the  regression  parameters  is: 

(3.3)  E{«2/Y=y}  =  BQ  +  V1X,(rf2In  +  V  +  X^X’fVy  -  X0Q). 

The  results  of  Theorem  3.1,  although  straightforward,  are  not 

particularly  revealing;  equations  (3.2)  and  (3.3)  do  not  provide  much 
intuition.  The  following  corollaries  and  theorems  will  help  to  clarify  this 
situation.  We  begin  by  observing  how  (3.2)  and  (3.3)  are  related. 

Corollary  3.1.1:  (i)  The  prediction  for  Y^  has  a  natural  decomposition  as  the 

sum  of  an  approximating  polynomial  whose  coefficients  are  estimated  by 
E{®2/Y=*y}  and  a  second  term,  which  Blight  and  Ott  call  the  correction  for 
bias,  depending  on  v: 


17 


(3.4) 


*{V**y}  “  *'E{®2/**y}  +  T'(o2In  +  V  +  *r1*')“1(y  -  X^) 

-  *'x{e2/*-y}  +  xln^r-y}, 

whtrt  nx  denotes  the  bias  contribution  at  x,  following  the  notation  of  (2.3). 
(ii)  The  bias  term  in  (3.4)  can  be  simply  expressed  in  terms  of  the  residuals 
at  the  design  points  when  the  approximating  polynomial  alone  is  fit.  Let  us 

*  A  r  I 

denote  that  vector  of  predictions  by  T^p  »  that  is,  XE {82/T“y }.  In  a 

A  A 

similar  fashion,  denote  the  corresponding  residual  vector  by  e^-  y  -  Yftp. 
Then  e{ti ^/t^y}  ■  v'(o2In  +  v)-1*Ap* 

The  corollary  provides  valuable  insight  into  the  role  that  the  matrix  V* 
plays  in  predicting  Y^.  Th-  elements  of  this  matrix  reflect  the  prior 
covariance  assumed  for  the  bias  (or  model  inadequacy)  at  appropriate  pairs  of 
points  in  the  design  space.  Let  us  denote  the  covariance  function  over  all 
such  pairs  by  vCx^Xj).  Then  v'-(v(x,x1 ),....  fVtx.x^  ) .  Substituting  this 
into  the  second  term  of  (3.4)  yields: 

n 

(3.5)  e{y  /Y-y}  *  *'E{e  /X-y}  +  l  a  v(x,x  ), 
x  i  j-1 

where  the  coefficients  a^  are  estimated  from  the  data.  The  prediction 
equation,  as  a  function  of  x,  combines  the  approximating  polynomial  with  a 
linear  combination  of  n  functions  which  are  completely  determined  by  the  form 
of  the  covariance  function  and  the  choice  of  design  points. 

This  might  provide  useful  guidelines  for  choosing  the  covariance 
function,  since  some  choices  may  lead  to  especially  appealling  prediction 
equations  while  others  may  have  undesirable  consequences.  For  example.  Blight 
and  Ott  considered  the  special  case  of  univariate  polynomial  regression  and 
suggested  using  v(x,z)  “  p2X^x  ,  0<X<1,  which  is  the  covariance  function 
for  a  first  order  autoregressive  process.  It  can  be  seen  from  (3.5)  that  this 
results  in  a  prediction  equation  whose  derivative  is  discontinuous  at  each 
design  point.  If  it  is  believed  that  the  response  function  has  a  continuous 
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derivative,  then  this  covariance  function  may  be  a  poor  representation  of 

prior  belief.  Similarly,  Smith's  prior  specification  for  univariate 

2 

regression,  whereby  V  =  T  I,  is  called  into  question.  V  will  be  proportional 
to  the  identity,  in  general,  only  if  v(x,z)  decreases  rapidly  as  x  and  z 
become  distant  from  one  another.  But  then  the  n  functions  v(x,x^)  in  (3.5) 
will  each  have  a  rather  sharp  "spike"  around  x^.  The  resulting  prediction 
function  will  deviate  from  the  approximating  polynomial  only  in  the  vicinity 
of  the  design  points,  but  these  deviations  may  be  quite  sharp.  This  also 
would  seem  to  be  an  unlikely  summary  of  prior  belief  about  the  nature  of  the 
response  function. 

There  are  several  interesting  limiting  cases  of  the  above  formulas.  We 
have  already  observed  that  it  is  not  uncommon  to  assign  a  diffuse  prior  to  the 
regression  coefficients.  This  special  case  is  treated  by  Lindley  and  Smith 
[1972]  and  by  Blight  and  Ott.  It  is  also  the  situation  in  which  the  spline 
theory  can  be  related  to  these  models.  As  noted  by  Lindley  and  Smith,  the 
prior  is  made  diffuse  by  allowing  V  ^  to  tend  to  0.  The  following  theorem 
applies  the  results  of  Theorem  3.1  to  this  special  case. 

Theorem  3.2:  Suppose  a  diffuse  prior  is  assumed  for  the  regression 
coefficients.  Then  the  predicted  response  at  x  is: 

(3.6)  lim  e{y  /T=Tr}  =  r' [X,*"1X]"1*,ii1  y 

V-  ♦  0  x 

1  +  v'  {if1-  *r1X[X,*"1X]"V|T1}y, 

2 

where  M  *  (o  I  +  V).  The  estimated  regression  coefficients  are: 

(3.7)  lim  E{«2/Y=y}  =  [X'lf 

v1  +0 

Note  that  for  fixed  8^  the  sampling  distribution  for  Y  under  (3.1)  is 
2 

Y  **•  N(X8  ,a  I  +  V) .  The  maximum  likelihood  estimate  of  8_  under  this  model 
2  n  2 

is  precisely  the  estimate  given  in  (3.7).  Thus  the  well-known  correspondence 
between  maximum  likelihood  estimation  of  regression  parameters  and  Bayesian 


19 


« 


estimation  with  a  diffuse  prior  is  valid  for  these  models. 

As  with  Theorem  3.1,  the  above  equations  suggest  a  natural  decomposition 
of  the  prediction  equation  into  an  approximating  polynomial  plus  a  "correction 
for  bias"  functions 

Corollary  3.2.1s  An  alternative  expression  for  the  prediction  equation  when 
the  regression  parameters  are  assumed  to  have  a  diffuse  prior  iss 

(3.8)  lim  e{y  /Y=y}  =  s’lim  e{«  /T=y}  +  v'lT1  {i  -  XfX'M^X)  "Vm"1  }y. 

v“+o  *  v“+o  2 

Theorem  3.2  also  suggests  a  different  way  to  write  (3.6)  and  (3.7),  which 
directly  contrasts  the  importance  of  bias  and  experimental  error. 

Corollary  3.2.2s  Suppose  that  V*  is  a  scaled  version  of  a  "standardized" 

*  *  * 

covariance  matrix  R  ;  that  is,  V  =  TR  .  Denote  by  R,  r  and  r  the 

* 

partitioned  pieces  of  R  which  correspond  to  V,  ▼  and  v,  respectively,  in 

*  .2 

V  .  Let  X  =  a  /t.  Thens 

(3.9)  liiy  e{y  rt= y}  -  *•  (X’c“1X)"1X'c”1y 

V  4-0 

1  -1  -1  -1  -1  -1 

+  r'{C  -  C  X(X'C  X)  X'c  }y, 

where  C  =  Xl  +  R,  and 
n 

(3.10)  lim  e{8  /Y=y}  =  (X,C_1X)_1X,c“1y 

V~  +0 

Corollary  3.2.2  is  important  in  that  it  shows  the  contrasting  roles  of 

experimental  error  and  bias  in  determining  the  prediction  equation  (3.9)  and 

2 

the  estimated  regression  coefficients  (3.10):  they  depend  on  a  and  T  only 
through  their  ratio  X.  This  ratio  thus  provides  a  concise  summary  of  how 
"large"  or  "small"  are  the  elements  of  V.  Allowing  X  to  range  from  0  to 
infinity  permits  us  to  model  situations  from  those  in  which  the  bias  is 
totally  dominant  (as  might  be  the  case  in  numerical  analysis)  through  those  in 
which  the  experimental  error  is  dominant  (when,  say,  scientific  knowledge 
provides  an  exact  form  for  the  response  function).  This  parameter  corresponds 
directly  to  the  smoothing  parameter  in  the  spline  formulation  (2.8)  and  we 
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shall  see  in  Theorems  3.3  and  3.5  how  it  acts  as  a  smoothing  parameter  here. 

(An  essential  assumption  in  obtaining  this  result  is  that  the  regression 

coefficients  have  a  diffuse  prior;  if  a  proper  prior  is  used,  the  ratio  of 
2 

o  to  t  does  not  determine  the  conditional  expectations . ) 

An  important  conclusion  of  the  corollary  is  that  the  degree  to  which  a 
polynomial  approximation  P( x)  is  an  adequate  representation  of  a  respnse 
function  g(x)  depends  on  the  extent  of  experimental  error,  not  just  the 
absolute  bias,  g(x)  -  P(x).  Thus,  for  example,  even  if  the  response  function 
differs  considerably  from  the  form  of  the  approximating  polynomial, 
substantial  experimental  error  will  most  likely  make  it  impossible  to  detect 
this  and  our  estimate  of  the  response  function  should  not  deviate  greatly  from 
the  approximating  polynomial.  On  the  other  hand,  if  experimental  error  is 
very  small,  even  minor  departures  from  the  approximating  polynomial  may  be 
graphically  obvious  from  a  simple  plot  of  the  data  and  our  estimator  should  be 
modified  accordingly.  This  fundamental  observation  that  the  magnitude  of  bias 
must  necessarily  be  evaluated  relative  to  experimental  error  is  also  a  basic 
feature  of  the  design  criteria  developed  by  Box  and  Draper  [1959]. 

The  corollary  suggests  that  V*  be  written  as  proportional  to  a 

"standardized"  covariance  matrix.  Although  there  is  no  precise  way  of 

specifying  what  is  meant  by  "standardized",  we  can  offer  some  guidelines.  One 

natural  possibility,  if  the  diagonal  elements  of  ▼*  are  all  assumed  to  be 

equal  (as,  for  example,  in  Smith  and  in  Blight  and  Ott),  is  to  let  R*  be  the 

* 

correlation  matrix  which  corresponds  to  V  ;  both  of  the  preceding  papers  used 
precisely  such  a  parameterization.  When  the  bias  variance  is  thought  to 
depend  on  x,  which  might  be  appropriate  for  many  response  surface  situations, 
then  R*  might  be  defined  so  that  the  variance  for  some  specified  x  would  be 
equal  to  1.  Then  X  would  represent  the  ratio  of  experimental  error  to  bias 
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t 


at  that  particular  x. 


*  * 

Theorem  3.3:  Suppose  we  can  write  V  =  TR  ,  as  in  Corollary  3.2.2.  Then  we 
have  the  following  limiting  forms  as  X+“: 

(3.11)  lim  lim  e{y  /Y=y}  =  *'  (X'X)-1X'y. 

X+®  V  "'♦0  * 

(3.12)  lim  lim  e{8  /Y=y}  =  (X'xTVy. 

x+«  v~+o 

If  R  is  non-singular,  then  we  have  the  following  limits  as  X+0: 

(3.13)  lim  lim  e{y  /T=y}  =  x' (X'R_ ’x)" ’x'r" V 

X+0  V1  *°  *  -1  -1  -1  -1  -1 

+  r* {R  -  R  X(X’R  X)  X'R  }y,  and 

(3.14)  lim  lim  e{8  /Y=y}  =  (X'R~1X)'1X'R-1y. 

X+0  V^+0 

The  first  half  of  Theorem  3.3  yields  familiar  answers:  the  ordinary 
least  squares  estimators.  Thus,  ordinary  least  squares  obtains  as  a  limiting 
case  of  the  Bayesian  model  when  the  regression  parameters  have  a  diffuse  prior 
and  when  the  bias  is  assumed  to  be  negligible  relative  to  experimental  error 
(i.e.  when  the  approximating  polynomial  is  assumed  to  exactly  represent  the 
response  function).  The  second  half  of  Theorem  3.3  does  not  have  so  immediate 
an  interpretation.  However,  the  following  corollary  makes  clear  what  happens 
when  X  tends  to  0. 

A 

Corollary  3.3.1:  Denote  by  Y  the  nxl  vector  of  predicted  values 

A 

corresponding  to  the  n  design  points;  that  is,  Y  =  E {y  /Y*y}.  If  R  is  a 

*i 

non-singular  matrix,  then: 

A 

lim  lim  Y  =  y; 

X+0  V^+0 

i.e.  the  prediction  equation  interpolates  the  observed  data.  If  there  are 
replicate  observations  at  some  of  the  design  points,  these  will  contribute 
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identical  rows  to  R,  making  it  singular.  Although  the  corollary  cannot  then 
be  applied  directly,  it  is  easy  to  use  the  corollary  to  show  that  the  limiting 
prediction  curve  as  X>0  interpolates  the  averages  of  the  responses  observed 
at  each  design  point. 

If  we  think  of  the  prediction  equation  e{y ^/X* y}  as  a  function  of  X, 
we  see  that  it  varies  from  an  interpolant  (when  X»0 )  to  the  least  squares 
approximating  polynomial  (as  X  tends  to  infinity).  The  former  result  seems 
intuitively  reasonable,  for  X»0  describes  the  situation  in  which  there  is  no 
experimental  error,  so  that  the  observed  responses  are  exact  values  of  the 
response  function.  The  prediction  equation  reflects  this  certain  knowledge  by 
correctly  predicting  the  response  at  those  points. 

This  particular  characterization  of  the  prediction  equation  as  a  function 
of  X  is  well-known  in  the  spline  literature  (see,  for  example,  Kimeldorf  and 
Wahba  [1971]).  Blight  and  Ott  were  evidently  unaware  that  it  held  for  their 
model  as  well.  They  proposed  a  parametric  form  for  the  R  matrix,  and  then 
suggested  that  these  parameters  and  X  be  jointly  estimated  by  minimizing  the 

n  *  2 

residual  sum  of  squares,  S(R, X)  »  \  (y^  -  Y^)  .  It  is  clear  from  the 
corollary  that  this  will  always  be  minimized  when  X  “  0,  regardless  of  the 
values  of  the  other  parameters. 

We  can  also  state  some  additional  properties  of  the  predicted  value 

A 

vector  Y.  A  common  characterization  of  Bayes  estimates  is  that  they  can  be 
expressed  as  weighted  averages  of  their  prior  means  and  the  observed  data. 

This  is  not  possible  for  predicting  the  response  at  an  arbitrary  point  x  since 
in  general  no  observation  has  been  made  there;  however,  for  the  n  observed 
data  points  and  for  the  estimated  regression  coefficients,  we  can  write  such 
weighted  averages. 
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Theorem  3.4:  The  predicted  value  vector  Y  can  be  expressed  as  a  weighted 
average  of  its  prior  mean,  X0g  and  the  observed  responses,  y,  where  the 
weights  are  inversely  proportional  to  the  respective  measures  of  variation, 

V  +  XI^X*  and  0*  t 

(3.15)  y  -  [cf2i  +  (v  +  xv.x* )" V1  [<f2y  +  (▼  +  xy.x* )-1xftj. 

n  i  I  u 

Similarly,  the  estimated  regression  coefficients  can  be  written  as  a  weighted 
average  of  their  prior  mean  and  the  observed  responses,  with  the  weights 
inversely  proportional  to  the  prior  and  data  precision  matrices,  respectively: 

(3.16)  E{e2/Y«y}  =  [X'(02In+  V)_1X  +  V~  V’ [X' ( <J2In+  V)_1y  +  v”1^]. 

The  following  theorem  demonstrates  more  clearly  the  link  between  the 

Bayesian  models  and  the  statistical  smoothing  approach. 

A 

Theorem  3.5:  Y  solves  the  minimization  problem:  find  u  to  minimize 

(3.17)  (o-y)'(tt-y)  +  (m-X*0  )  •  [o2  (V  +  XI^X*  f1 )  (o-XPQ ) . 

If  V1  0,  (3.17)  becomes: 

(3.18)  (m-y)'(u-y)  +  Xu'  (if1-  r“1X(X'e_1X)"1X’*''1  )n, 

where  X  and  R  are  defined  as  in  Corollary  3.2.2.  Moreover,  the  second  term 
in  (3.18)  is  0  if  and  only  if  u  e  col(X),  that  is,  if  and  only  if  u  can  be 
written  as  a  linear  combination  of  the  columns  of  X. 

Both  (3.17)  and  (3.18)  characterize  the  prediction  vector  Y  as  the 

mm 

solution  to  a  minimization  problem  composed  of  two  terms:  the  residual  sum  of 

A  A 

squares/  (Y-y) '  (Y-y) ,  and  a  quadratic  penalty  term.  For  the  general  case, 

A 

(3.17),  the  quadratic  penalty  is  in  terms  of  the  distance  of  Y  from  its  prior 

expectation,  This  will  clearly  have  the  effect  of  "shrinking"  the 

vector  of  predicted  values  toward  the  prior  expectation.  The  extent  of  this 

2  -1 

shrinkage  depends  on  the  weighting  matrix  a  (▼  +  XY^X')  .  This  matrix  is 
2 

proportional  to  a  ,  but  inversely  proportional  to  the  prior  variance.  Thus 
the  prior  expectation  will  be  most  important  when  our  prior  precision  is  great 
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relative  to  experimental  error;  when  our  prior  precision  is  not  great,  the 
data  will  dominate  the  prior  in  determining  Y. 

The  quadratic  penalty  term  undergoes  several  interesting  changes  in  the 
limiting  case  of  (3.18).  First,  the  penalty  can  be  expressed  solely  in  terms 
of  the  "standardized"  bias  covariance  matrix  R  and  the  variance-bias  tradeoff 
parameter  X.  Second,  the  penalty  is  independent  of  the  prior  expectation. 
Third,  the  penalty  is  0  only  for  those  vectors  of  predicted  values  which  are 
in  the  column  space  of  X;  i.e.  for  those  vectors  of  predicted  values  which  can 
be  exactly  written  as  an  approximating  polynomial.  The  meaning  of  these  last 
two  points  is  that  the  penalty  does  not  induce  shrinkage  toward  a  particular 
pre-specified  vector;  rather,  in  a  more  general  sense,  there  is  shrinkage 
toward  the  response  plane  spanned  by  the  approximating  polynomial.  Finally, 
note  that  equation  (3.18)  is  an  exact  discrete  analogue  of  (2.8),  the 
continuous  smoothing  problem  which  leads  to  generalized  spline  estimation. 

Were  we  to  discretize  the  more  general  penalty  function  of  (2.8)  and  limit  it 
to  the  design  points,  we  would  obtain  an  equation  like  (3.18);  unfortunately, 
it  is  impossible  to  follow  this  path  in  the  reverse  direction  and  derive  a 
continuous  penalty  which  corresponds  to  a  discrete  one.  Nonetheless,  (3.18) 
illustrates  the  close  link  between  spline  estimation  and  the  Bayesian  models 
with  a  diffuse  prior  on  the  regression  coefficients. 

When  ®*  Theorem  3.5  allows  us  to  show  how  the  residual  sum  of 

squares  depends  on  the  choice  of  X.  Let  us  denote  the  vector  of  predicted 

A 

values  by  Y(X)  to  emphasize  its  dependence  on  X.  Then  define  the  residual 

A  A 

sum  of  squares  function  by  RSS(X)  =  (Y( X)  -  y] * [Y( X)  -  y]  . 

Corollary  3.5.1;  RSS(X)  is  a  monotone  increasing  function  of  X,  with 
RSS(X=«)  equal  to  the  residual  sum  of  squares  from  fitting  the  approximating 
polynomial  by  ordinary  least  squares  and,  provided  R  is  non-singular,  with 
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I 


RSS(O)  -  0. 

Additional  comments  on  the  type  of  estimates  produced  by  these  Bayesian 
models  will  be  found  in  section  7,  where  several  examples  are  presented. 
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4.  Inference  for  Future  Observations:  Precision 


Statisticians  are  rarely  satisfied  with  point  estimates  alone;  some 
measure  of  the  precision  of  the  estimates  is  also  essential.  Since  the 
Bayesian  models  base  their  estimates  on  posterior  distributions  of  the 
quantities  of  interest,  rather  them  on  sampling  properties,  it  is  natural  to 
obtain  measures  of  precision  from  the  posterior  distributions,  as  well.  Since 
the  scope  of  this  paper  is  limited  to  multivariate  normal  distributions,  the 
natural  measures  of  precision  are  the  appropriate  posterior  variance 
matrices.  These  will  be  presented  in  this  section. 

Theorem  4.1:  Let  Yx  denote  a  potential  future  observation  at  x.  Suppose  the 
joint  distribution  of  Yx  and  the  observed  response  vector  Y  is  given  by 

(3.1) .  Then  the  posterior  variance  of  Yx  ist 

(4.1)  Var  {y  j/Y-y}  =  o2  +  v  +  *' (X*  (o2In+  Y)_1X  +  V^V’x 

-  2w' (02I  +  V  +  XV.X' )_1XY.* 

nil 

-  v1 (02I  +  ▼  +  XV,X' )_1V. 

n  i 

The  posterior  variance  matrix  for  the  regression  coefficients  iss 

(4.2)  Var{«2/Y=y}  =  [X' (C2In+  V)_1X  +  v"1 ] _1 . 

The  posterior  variance  of  Yx  is  composed  of  several  components.  One  of 
2 

these  is,  of  course,  0  ,  since  any  future  observation  is  itself  subject  to 
observational  error.  Another  component  (the  last  term  on  the  first  line  of 

(4.1) )  is  clearly  seen  to  be  Var{*' 82/Y®y},  the  posterior  variance  of  the 
approximating  polynomial  part  of  Y^.  The  remaining  components  involve 
quadratic  forms  of  the  bias  covariances  and  the  regressor  variables,  but  do 
not  suggest  any  obvious  interpretation.  Both  of  the  posterior  variances  given 
above  depend  on  the  choice  of  the  experimental  design  (i.e.  on  X),  but  not  on 
the  observed  responses  y,  a  well-known  property  of  conditional  variances  for 
multivariate  normal  distributions. 
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As  in  the  last  section,  it  is  of  particular  interest  to  study  the  special 
case  when  the  regression  coefficients  are  assigned  a  diffuse  prior* 

Corollary  4.1.1:  Given  the  assumptions  of  Theorem  4. 1 : 

(4.3)  lim  Var {y  /Y-y}  -  a2  +  v  +  *’ (X'M-1*)-1* 

v"  -0  x  _1 

-  2t'«  X(X'II  X)  * 

-  v'tiT1-  m'1x(x,m-1x)"1x,m"1]t. 


where  H  *  o  1^  +  T.  Similarly: 


(4.4)  lijji  Var{e2/Y-y}  -  (X'«  X) 


V1  +0 

Once  again  the  use  of  a  diffuse  prior  leads  to  the  same  variance  matrix 
for  the  regression  coefficients  that  would  be  obtained  by  traditional  sampling 
theory  methods  for  this  model. 

The  formulas  of  Corollary  4.1.1  can  be  rewritten  in  a  standardized  form 

in  precisely  the  same  manner  as  the  estimates  of  the  preceding  section. 

Following  the  same  notation,  let  R  denote  the  standardized  bias  covariance 

2 

matrix,  with  V  *  tr,  and  let  X  -  o  /t  be  the  variance-bias  tradeoff 

2 

parameter.  Unlike  the  posterior  expectation  of  Y%,  which  depended  on  a  and 

T  only  through  X,  the  posterior  variance  proves  to  be  proportional  to 
2 

a  ,  with  the  constant  of  proportionality  a  function  of  only  X. 

Corollary  4.1.2:  Under  the  assumptions  of  Theorem  4.1: 

2,.  ,  ,-1  .  ,-1  _ -1-1 


V1  +0 


(4.5)  li^i  Var{y ^/Y-y}  -  0  {l  +  X  r  +  X  «' (X’C  X)' 

-  2X”1r,C-1X(X,C_1X)"1* 

-  X_1 r'  (C_1-  (f1X(X,c"1X)"1X,c"1]r}, 


where  c  *  Xl  +  R,  and: 
n 


(4.6)  liip  Var{62/Y*y}  -  02X_1  (R’C*1!)"1 . 


V1  +0 
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5.  Testing  For  Lack  Of  Fit 


The  Bayesian  models  studied  in  this  paper  have  been  designed  to  describe 
situations  in  which  an  unknown  response  function  is  approximated  by  a  simple 
graduating  function  such  as  a  polynomial.  The  models  seek  to  account  for 
this  approximate  nature,  whereas  traditional  analysis  typically  proceeds  as 
though  the  graduating  function  were  an  exact  representation  of  the  response 
function.  However,  traditional  analysis  is  not  blind  to  the  fact  that  a 
particular  graduating  function  may  be  inadequate  to  represent  a  complex 
response  function,  and  numerous  diagnostic  procedures  have  been  developed  to 
help  us  judge  model  inadequacy.  One  such  procedure  is  the  lack  of  fit  test. 
This  section  will  show  how  the  lack  of  fit  test  relates  to  the  Bayesian  models 
of  section  2. 

The  classical  lack  of  fit  test  depends  on  the  availibility  of  an 

2 

"independent"  estimate  of  the  error  variance  o  .  (By  independent,  we  mean  an 
estimate  which  is  independent  of  the  assumed  approximating  function.)  Such  an 
estimate  might  result  from  knowledge  of  the  experimental  situation,  past 
experience  or,  as  is  often  the  case,  from  the  inclusion  of  replicate 
observations  in  the  experiment.  The  extent  to  which  the  observed  responses 
vary  about  the  estimated  graduating  function  can  then  be  compared  to  this 
standard;  if  the  variation  is  excessive,  this  is  an  indication  that  the 
approximating  function  is  inadequate. 

Specifically,  suppose  the  approximating  model  (1.2)  is  exact,  so  that 

2  * 

Y  =  X6  +  e,  and  suppose  further  that  C  «■  N(0, a  I).  Let  Y  denote  the  vector 
of  predicted  responses  obtained  from  ordinary  least  squares  regression.  Then 
it  is  well  known  that  the  scaled  residual  sum  of  squares  is  distributed  as  a 
chi-squared  random  variable: 

(5.1)  RSS/O2  -  X^_p/ 
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A  A 

where  RSS  »  (Y  -  Y)  *  (Y  -  Y)  is  the  residual  sum  of  squares,  n  is  the  number 

of  observations  and  p  is  the  rank  of  Z.  The  lack  of  fit  test  thus  compares 

2 

the  observed  RSS,  scaled  by  the  prior  estimate  of  o  ,  with  percentage  points 

2 

of  the  corresponding  chi-squared  distribution.  When  the  estimate  of  a  is 
itself  obtained  from  the  experiment  via  replicate  observations,  the  analogous 
procedure  is  to  divide  the  RSS  into  two  components: 

(5.2)  RSS  =  SS(Pure  Error)  +  SS(Lack  of  Fit), 
where  SS(Pure  Error)  is  the  replication  sum  of  squares  and  SS(Lack  of  Fit)  is 
the  sum  of  suared  deviations  of  the  replicate  averages  about  the  estimated 
graduating  function.  These  terms  are  divided  by  their  appropriate  degrees  of 
freedom,  and  their  ratio  then  has  an  F  distribution  under  the  assumption  that 
the  approximating  model  is  correct. 

The  Bayesian  models  described  in  section  2  account  for  the  approximate 
nature  of  the  graduating  function  by  adding  an  extra  term.  In  section  3,  we 
saw  that  when  these  models  assign  a  diffuse  prior  to  the  regression 
coefficients,  the  extent  to  which  the  added  term  influences  the  estimates  is 
controlled  by  the  single  parameter  X,  the  ratio  of  experimental  error  to 
bias.  In  particular,  the  limiting  case  of  X  =*  «•  reduces  to  the  traditional 
model  in  which  the  graduating  function  is  assumed  to  exactly  represent  the 
response.  Thus  it  seems  reasonable  that  we  should  be  able  to  generalize  the 
lack  of  fit  test  to  the  Bayesian  models  as  a  test  regarding  X  in  such  a  way 
that  the  test  described  above  results  in  the  limit  as  X  tends  to  •». 

A  general  method  for  deriving  diagnostic  checks  of  parameters  in  Bayesian 
models  has  been  suggested  by  Box  [1980],  He  advocates  examining  the 
consistency  of  the  observed  data  vector  Y  with  the  predictive  distribution 
implied  for  Y  by  the  model.  Of  course,  the  predictive  distribution  of  Y  is 
simply  the  marginal  distribution  given  in  Lemma  3.1  of  the  appendix: 
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(5.3a)  T  -  N(«0,a2I  +  V  +  X^X' ). 

Using  the  reparameterization  suggested  by  the  standardized  form  of  Corollary 

-1  2 

3.2.2,  let  us  replace  V  by  X  a  R  to  obtain: 

(5.3b)  T  -  N(X»0,O2I  +  X-102R  +  XV^*). 

A  logical  diagnostic  checking  function  based  on  the  predictive  distribution  is 
the  sum  of  squared  deviations  from  the  mean  vector  weighted  by  the  inverse  of 
the  covariance  matrix: 

(5.4)  h(X)  -  (T  -  Xe0)*<<J2I  +  X-VR  +  XX1X,)”1(T  -  XpQ). 

If  the  model  is  true,  then  h(X)  should  be  distributed  as  a  chi-squared 

random  variable  with  n  degrees  of  freedom.  Assuming  the  other  parameters  in 

2 

the  model  are  exactly  specified  (including  a  ),  this  can  be  used  to  check 
whether  an  hypothesized  value  for  X  is  consistent  with  the  data. 

We  have  already  remarked  that  considerable  interest  focuses  on  the 
special  case  when  the  regression  parameters  are  assigned  a  diffuse  prior.  We 
now  examine  how  this  affects  the  diagnostic  checking  procedure  described 
above.  To  do  so,  we  must  first  establish  some  simple  results  about  the 
residual  vector  which  results  when  a  diffuse  prior  is  used.  Proofs  for  all  of 
the  following  propositions  are  given  in  the  appendix. 

-1  2 

Lemma  5. 1 :  Suppose  T  follows  the  model  described  by  (2.1),  with  V  -  X  a  R. 

A 

Let  Y( X)  denote  the  vector  of  predicted  values  when  the  regression 

A 

coefficients  are  assigned  a  diffuse  prior,  and  let  •( X)  denote  the 

A  A 

corresponding  residual  vector;  i.e.  e(X)  ■  Y  -  Y(X).  Then: 

(5.5)  i(X)  -  B(X)Y, 

where  B(X)  -  X(C-1  -  c’1X(X'c”1X)"1X,c“1]  and  C  -  XI  +  R. 

n 

* 

Theorem  5.1:  Let  us  denote  by  h  (X)  the  special  case  of  h(  X)  which 
obtains  when  the  regression  parameters  have  a  diffuse  prior;  i.e. 
h*(X)  -  lif  h ( X ) .  Then: 
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(5.6) 


h*(X)  -  [B(  X)Y]  '  (I  +  X_1R)[B(X)T]/a2. 
n 

In  the  limit/  as  X  ♦  •,  we  have: 

(5.7)  h*(X)  -  RSS/O2, 

where  RSS  denotes  the  residual  sum  of  squares  which  results  from  ordinary 
least  squares  regression. 

The  limiting  case  given  in  (5.7)  above  is  precisely  the  traditional  lack 

2 

of  fit  statistic  suggested  in  (5.1)  for  situations  in  which  o  is  assumed  to 
be  known.  Thus,  the  predictive  check  described  above  does  provide  a 
generalization  of  the  traditional  lack  of  fit  test  when  the  regression 
coefficients  are  assigned  a  diffuse  prior.  For  general  X  the  diagnostic 

statistic  given  in  (5.6)  is  a  weighted,  scaled,  residual  sum  of  squares,  where 

-1  -I  2 

the  weighting  matrix  is  1^  +  X  R  “  X  C  and  the  scale  factor  is  o  . 

An  interesting  question  arises  concerning  the  appropriate  reference 

distribution  with  which  to  compare  the  diagnostic  statistic  (5.6).  We 

obtained  (5.6)  as  a  limiting  case  of  (5.4),  which  we  argued  earlier  should 

have  a  chi-squared  distribution  with  n  degrees  of  freedom.  However,  when  we 

examined  the  limiting  case  of  (5.6)  which  corresponds  to  perfect  faith  in  the 

approximating  model,  we  obtained  (5.7),  the  traditional  lack  of  fit  statistic, 

which  has  a  sampling  distribution  which  is  chi-squared  with  n-p  degrees  of 

freedom,  where  p  is  the  rank  of  X.  What  happened  to  the  p  degrees  of  freedom, 

and  which,  if  either,  of  these  distributions  is  an  appropriate  reference 

distribution  for  (5.6)? 

It  seems  clear  that  of  the  two  limiting  processes  which  lead  from  (5.4) 
to  (5.7),  the  more  important  is  the  use  of  the  improper  prior  for  the 
regression  coefficients.  This  affects  p  dimensions  of  the  prior  distribution, 
precisely  the  number  of  degrees  of  freedom  lost.  We  might  then  think  of  the 
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improper  prior  as  forcing  us  to  replace  those  p  dimensions  with  information 

from  the  data,  at  the  expense  of  information  about  the  residuals.  The  results 

of  Theorem  3.5  are  also  germane  here.  Theorem  3.5  characterized  Y  as  a 

shrinkage  type  estimate.  When  the  regression  coefficients  were  assigned  a 

proper  prior,  the  shrinkage  was  in  the  direction  of  the  prior  mean, 

however,  when  an  improper  prior  was  used,  the  shrinkage  was  in  the 

direction  of  the  p-dimensional  regression  plane,  rather  than  toward  a  specific 

vector  in  that  plane.  Again,  this  suggests  that  using  a  data  estimated 

Y  rather  than  a  prior  mean  vector  reduces  the  dimension  of  the  space  in  which 

the  residuals  lie,  resulting  in  a  loss  of  p  degrees  of  freedom.  Thus, 

although  it  is  not  formally  true  that  h  (A)  has  a  probability  distribution 

(an  inevitable  consequence  of  using  a  vague  prior  for  the  regression 

2 

coefficients),  we  feel  that  X^_p  is  an  appropriate  reference  distribution  for 
the  diagnostic  statistic  (5.6)  which  can  be  used  to  check  any  hypothesized 
value  of  A. 

Lemma  5.2:  Consider  the  weighted  residual  sum  of  squares  which  appears  in 
(5.6),  but  with  the  vector  of  predicted  values,  and  hence  the  residual  vector, 
determined  by  a  proper  prior.  For  any  As 

li?p  E{(Y  -  Y)'(In  +  A-1R)(Y  -  Y)}  -  (n-p)d2. 


* 

The  above  expression  is  not  the  expected  value  of  h  ( A)  since  we  have 

* 

reversed  the  limit  and  the  expectation.  (As  noted  above,  formally,  h  (A) 

does  not  have  a  probability  distribution,  and  hence  has  no  expectation.) 

Nonetheless,  the  lemma  does  provide  additional  support  for  the  claim  that  the 

2 

proper  reference  distribution  for  (5.6)  is  • 

Box  (1980]  found  that  many  of  the  predictive  checks  for  standard  models 
were  identical  to  hypothesis  tests  derived  from  corresponding  sampling  theory 
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models.  This  is  also  true  of  the  current  diagnostic  statistic.  An 

appropriate  sampling  theory  model  for  (2.1)  can  be  derived  by  viewing  the 

distribution  of  Y  given  a  fixed  value  for  the  regression  vector  8^ ,  rather 

than  averaged  over  a  prior  distribution  for  Bj.  The  resulting  model  is  a 

standard  linear  model,  but  with  correlated  errors: 

(5.8)  Y  **•  N(X8  ,  <j2I  +  X-1(j2R). 

2  n 

Theorem  5.2:  Given  the  sampling  model  (5.8)  and  assuming  that  o'  is  known, 

the  appropriate  lack  of  fit  statistic  for  the  model  e{y}  *  XB^  is  (5.6). 

2 

The  sampling  theory  distribution  of  (5.6)  is  xjj  • 

2 

Thus  Theorem  5.2  provides  still  further  support  that  is  a*1  appropriate 

reference  distribution  for  (5.6). 

The  theory  presented  thus  far  might  be  used  to  test  whether  a  particular 
value  of  X  is  consistent  with  the  observed  data.  It  is  common  practice  to 
apply  reverse  logic  to  such  procedures  and  to  consider  the  collection  of 
X's  which  are  consistent  with  the  data,  within  given  quantile  bounds  of  the 
reference  distribution.  Traditionally,  this  is  advocated  to  construct 
confidence  sets  from  hypothesis  tests.  Whether  or  not  we  choose  to  regard  the 
resulting  collection  of  X's  as  a  confidence  set,  the  exercise  is  instructive 
as  to  the  nature  of  the  proposed  predictive  check.  Theorem  5.2  can  be  used  to 
characterize  such  a  set,  via  the  following  corollary. 

Corollary  5.2.1:  h  (X)  is  a  monotone  increasing  function  of  X. 

2 

The  corollary  makes  it  clear  that  for  any  given  quantiles  from  •  the 

* 

set  of  X’s  for  which  h  (X)  lies  within  the  quantiles  will  be  an  interval 
(possibly  infinite).  This  is  intuitively  reasonable.  Moreover,  we  see  that 
some  choices  of  X  may  be  rejected  because  the  diagnostic  statistic  is  too 
large,  while  other  choices  may  be  deemed  inconsistent  with  the  data  because 


the  statistic  is  too  small  to  reasonably  have  a  xn_  distribution.  This 
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differs  from  the  traditional  lack  of  fit  test,  which  rejects  only  in  those 
instances  where  the  statistic  is  too  large.  This  is  a  consequence  of  the 
explicit  representation  of  bias  in  the  Bayesian  models:  the  traditional  test 
can  tell  us  only  that  the  approximating  model  is  inadequate,  while  the 
predictive  test  tells  us  whether  our  representation  of  the  inadequacy  is 
insufficient,  excessive  or  reasonable. 

The  predictive  check  (5.6)  can  also  be  extended  to  problems  for  which 

2 

a  is  unknown  but  can  be  independently  estimated.  Typically  this  occurs  when 
replicate  observations  have  been  made  at  some  of  the  design  points  —  since 
replicate  observations  have  the  same  expected  value,  any  differences  among 
them  can  be  attributed  solely  to  experimental  error.  This  leads  to  the 
decomposition  of  the  residual  sum  of  squares  into  "pure  error"  and  "lack  of 
fit"  components  given  in  (5.2).  If  m(i)  observations,  Y^ i , • • • *Yi,m( i) »  have 
been  made  at  x^,  i**1,2,...,k,  then  the  pure  error  sum  of  squares  is  given  by: 

k  m(i)  _ 

(5.9)  SS(Pure  Error)  -  £  £  (Y  -  Y  )  , 

11  1,3  1 

where  Y^  is  the  average  of  the  m(i)  observations  at  x^ .  The  lack  of  fit  sum 
of  squares  can  now  be  obtained  by  subtraction.  Alternatively,  we  can 
calculate  the  lack  of  fit  sum  of  squares  by  replacing  each  component  of  the 
response  vector  Y  by  the  average  of  all  the  replicate  observations  at  that 
point  and  then  comparing  this  vector  to  the  predicted  value  vector. 
Specifically,  let  Y  be  the  vector  generated  from  Y  by  replacing  Y^  j  by  Y^. 

Then  the  lack  of  fit  sum  of  squares  can  be  written: 

A  A 

(5.10)  SS(Lack  Of  Fit)  -  (Y  -  Y)'(Y  -  Y). 

It  is  easy  to  show  that  (5.9)  and  (5.10)  satisfy  (5.2). 

* 

The  diagnostic  function  h  (X)  can  also  be  decomposed  into  "pure  error" 
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and  "lack  of  fit"  components  when  replicate  observations  are  available.  We 

define  SS(Pure  Error)  exactly  as  in  (5.9),  while  SS(Lack  of  Fit)  is  found  by 

an  approach  analogous  to  that  which  led  to  (5.10).  As  above,  let  Y  be  a 

vector  generated  from  Y  by  replacing  each  observed  response  by  the  average  of 

all  the  responses  observed  at  the  same  design  point.  The  lack  of  fit  sum  of 

squares  is  then  calculated  from  (5.6),  but  with  Y  replacing  Y: 

(5.11)  SS(Lack  of  Fit)  =  [B(X)Y]'(I  +  X_1 R) [B( X)Y] . 

n 

It  is  easy  to  show  that  the  sum  of  (5.9)  and  (5.11)  is  the  weighted  residual 

sum  of  squares  in  the  numerator  of  (5.6). 

From  a  sampling  theory  point  of  view,  we  can  now  test  the  adequacy  of  the 

model  by  comparing  the  ratios 

[SS(Lack  of  Fit)/(k-p) ] /{SS (Pure  Error )/(n-k ) ] 

to  quantiles  of  an  F  distribution  with  k-p  and  n-k  degrees  of  freedom.  We  can 

also  justify  this  procedure  from  a  Bayesian  standpoint  by  considering  the 

2 

distribution  of  this  ratio  conditional  on  0  and  then  assigning  a  uniform 

improper  prior  distribution  to  log( 0) .  If  we  assume  that  the  distribution  of 

*  2  2 
h  (X),  conditional  on  o  ,  is  actually  x^p*  then  integrating  out 

2 

o  results  in  the  F  distribution  above. 

* 

Although  we  have  emphasized  the  use  of  h  ( X)  to  check  the  consistency  of 
X  with  the  data,  it  is  important  to  keep  in  mind  that  it  is  the  entire  model 
specification  that  is  being  checked,  not  just  the  particular  value  of  X. 

Thus,  the  previous  discussion  really  must  be  read  as  conditional  on  the 
assumption  that  we  are  quite  certain  of  the  best  choice  of  an  approximating 
polynomial  and  of  the  form  of  R;  only  the  relative  importance  of  bias  to 
experimental  error  is  really  at  question.  If  this  is  the  case,  then  an 
abnormally  large  or  small  value  of  the  diagnostic  statistic  can  legitimately 
be  interpreted  as  reflecting  an  inappropriate  choice  of  X,  and  we  may  seek  to 
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improve  the  model  by  changing  X. 

Often,  there  will  be  several  aspects  of  the  model  about  which  we  are 
unsure.  Since  the  diagnostic  statistic  (5.6)  is  an  omnibus  check  of  the 
model,  it  will  not  point  to  any  one  of  these  features  as  inadequate.  Rather, 
we  must  use  our  judgment  in  determining  which  aspects  of  the  model  should  be 
changed  in  order  to  make  it  consistent  with  the  data.  Sometimes,  a  more 
flexible  approximating  polynomial  may  suffice  to  render  the  bias  correction 
insignificant;  this,  of  course,  is  the  corrective  action  almost  invariably 
used  when  the  traditional  lack  of  fit  test  is  applied.  Graphical  techniques 
will  often  suggest  when  this  is  desirable.  In  other  cases,  revising  our 
assumptions  about  the  smoothness  of  the  response  function  as  reflected  in  R 
might  be  of  great  help.  However,  the  predictive  check  can  be  most  easily 
expressed  in  terms  of  its  relation  to  X  and,  in  general,  we  feel  it  should 
be  interpreted  as  an  indication  of  the  adequacy  of  X. 
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6.  A  Kalman  Filter  Version 

The  Kalman  filter  is  a  system  of  stochastic  equations  used  primarily  by 
control  engineers  to  describe  processes  that  evolve  over  time  in  a  structured, 
but  non-deterministic  fashion.  The  primary  objective  of  the  engineers  is  to 
study  the  properties  of  different  strategies  which  might  be  used  to  keep  the 
process  under  control.  However,  the  Kalman  filter  itself  is  quite  general  and 
may  be  used  to  model  a  wide  variety  of  statistical  problems.  A  recent  paper 
by  Harrison  and  Stevens  [1976]  has  suggested  how  the  Kalman  filter  can  be 
applied  to  the  problem  posed  here  of  studying  how  a  response  variable  depends 
on  several  inputs  when  an  exact  form  for  the  response  function  is  unknown. 

This  section  will  show  how  the  models  of  section  2  can  be  written  as  a  Kalman 
filter;  however,  we  will  argue  that  the  chief  advantage  of  this  model, 

Kalman's  recursive  estimation  procedure,  cannot  be  applied. 

We  will  follow  the  general  set-up  of  Harrison  and  Stevens.  They  proposed 
a  "dynamic  linear  model": 

(6.1a)  »t  =  Ffcet  +  ufc/  ufc  ~  N(0,Ot> 

(6.1b)  6_  =  G8t_1  +  wfc,  wt  ~  N(0,Wt). 

The  first  equation,  known  to  control  engineers  as  the  observation  equation, 
relates  a  vector  of  observed  variables,  Tfc,  to  a  known  matrix  Ft  of  regressor 
variables  via  a  linear  model  plus  random  noise.  The  second,  or  system, 
equation  relates  how  the  parameters  themselves  may  change  from  one  observation 
to  the  next  (herein  lies  the  "dynamic"  element  of  the  model).  It  is  assumed 
that  the  current  parameter  values  are  a  known  linear  transformation  G  of  the 
previous  values  perturbed  by  the  random  error  Initial  parameter 

estimates,  8Q,  are  needed  to  start  the  process.  Typically,  rather  than 
specify  exact  values,  a  prior  distribution  is  provided  for  8Q,  so  that  the 
model  has  a  distinctly  Bayesian  character. 


Harrison  and  Stevens  proposed  this  model  with  time  series  data  in  mind, 
so  they  naturally  interpreted  the  index  t  as  denoting  the  sequential  order  of 
the  observations.  However,  the  model  is  not  restricted  to  time  dependent 
data,  Xn  general,  one  can  think  of  t  as  simply  an  arbitrary  parameter  which 
labels  the  different  observations  (perhaps  the  order  in  which  experimental 
runs  were  made,  or  the  "standard"  ordering  for  the  experimental  design 
used).  The  matrix  Ft  then  corresponds  only  to  th“  values  of  the  regressor 
variables  for  the  t'th  observation,  not  the  entire  X  matrix  of  the  previous 
sections. 

The  rationale  for  ordering  the  observations  is  so  that  we  may  use 
Kalman's  recursive  estimation  algorithm.  The  algorithm  provides  a  simple  but 
powerful  way  to  calculate  the  posterior  distribution  of  given  y1 , , , . , ,yt 
and  F1,....,Ft.  Specifically,  if  the  prior  distribution  of  is  N(b^,Cq), 
then: 


et/yr*,,,yt'pr,‘*'Ft  ~  N(mt,ct)' 


(6.2) 

where  and  are  obtained  recursively  from  the  following  set  of 
equations .  Let : 

y 

e  ”  y  -  y. 


(6.3a) 

(6.3b) 

(6.3c) 

(6.3d) 

(6.3e) 

Then: 

(6.4a) 

(6.4b) 


V*t-1' 


't 
R  =  GC 


S 

A 


t-iG'  +  V 


w  +  V 


RF^S 


-1 


t-1 


+  Aa, 


Cfc  =  R  -  ASA' . 


Note  that  if  the  t'th  observation,  yt,  is  p- dimensional,  then  equations  (6.3) 
and  (6.4)  provide  a  technique  for  calculating  the  posterior  distribution  of 
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the  parameter  vector  which  never  requires  inversion  of  matrices  larger 
than  pxp,  regardless  of  the  sample  size.  In  particular,  if  y t  is  a  scalar, 
then  no  matrix  inversion  is  required  at  all.  This  special  case  was  remarked 
by  Plackett  [1950J  as  a  simple  device  for  updating  regression  estimates  when  a 
new  data  point  is  obtained. 

Many  common  statistical  models^  can  be  written  in  the  form  of  (6.1).  For 
example,  standard  linear  regression  can  be  described  by  taking  6  to  be  the 
identity  matrix  and  requiring  that  be  identically  0  for  each  tj  that  is, 
each  succesive  observation  depends  in  exactly  the  same  way  on  the  regressors. 
Harrison  and  Stevens  present  numerous  additional  examples  in  their  paper.  In 
particular,  they  remark  that  "hierarchical  models,  such  as  those  of  Lindley 
and  Smith  [1972],  can  also  be  formulated  as  dynamic  linear  models".  The 
Kalman  filter  approach  is  clearly  evident  in  O'Hagan's  [1978]  localized 
regression  model,  which  postulates  a  regression  model  in  which  the  parameters 
change  regularly  from  one  point  in  the  design  space  to  another  (see  in 
particular  the  comments  by  Priestley  and  Titterington) . 

We  will  now  show  how  Smith's  model  (2.1),  itself  a  special  case  of  the 
models  proposed  by  Lindley  and  Smith,  can  be  written  as  a  dynamic  linear 
model.  (However,  we  will  use  the  equivalent  formulation  of  Blight  and  Ott 
((2.3)  and  (2.4))  rather  than  (2.1)  in  order  to  avoid  confusion  with  respect 
to  the  indices  on  8  —  in  (2.1)  the  indices  1  and  2  are  used  to  denote 
parameters  at  the  first  and  second  tiers  of  the  model,  respectively,  while  in 
(6.1)  the  same  indices  are  used  to  denote  the  regression  parameters  for  the 
first  and  second  observations,  respectively.) 

Recall  from  (2.3)  that  the  vector  of  observed  responses  follows  the 
model :  t  =  xb  +  n  +  e.  Let  t  be  an  arbitrary  label,  as  suggested  above. 

Then  the  model  equation  for  the  t'th  observation  is: 
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•  I 

(6.5a)  Y  -  *t$  +  nt  +  et  -  *te  +  Nfc, 

where  Nfc  -  n  +  efc.  Since  we  wish  to  describe  (2.3)  on  an  observation  by 

observation  basis ,  is  a  scalar,  rather  than  the  more  general  vector 

observation  allowed  in  (6.1).  From  the  distributional  assumptions  for  the 

Blight  and  Ott  model  (2.4),  we  conclude  that  has  a  normal  distribution  with 

o 

mean  0  and  variance  c  +  1  ,  where  1*.  is  the  appropriate  diagonal  entry  from 
•  t  c 

L  \  Thus,  (6.5a)  gives  the  observation  equation  for  the  model. 

The  system  equation  which  corresponds  to  (2.3)  is  derived  in  the  same 
manner  mentioned  above  for  a  conventional  linear  regression  model.  It  is 
assumed  in  (2.3)  that  the  regression  parameters  remain  the  same  for  each 
observation.  This  implies  the  system  equation: 

(6.5b)  3t  -  Bt_1# 

where  Gt=I  and  wfc“0  for  each  t.  The  prior  distribution  for  0  given  in  (2.4) 
can  then  be  introduced  by  assuming  that  it  is  the  prior  distribution  for 
Since  the  system  equation  implies  that  . .  .*®n,  the  same  prior 

distribution  will  be  in  effect  for  each  observation.  Thus  for  each 
observation,  the  Blight-Ott  model,  and  hence  the  Smith  model,  can  be  written 
as  a  dynamic  linear  model. 

The  great  advantage  of  the  Kalman  filter  model  is  the  existence  of  the 
recursive  estimation  algorithm  (6. 2)- (6. 4)  which  provides  a  simple  method  to 
calculate  all  the  relevant  posterior  distributions  without  having  to  invert 
large  matrices.  Unfortunately,  the  Kalman  recursion  relations  are  not  valid 
for  (6.5).  The  reason  is  an  important  assumption,  unstated  by  Harrison  and 
Stevens  in  either  (6.1)  or  (6.2),  that  the  random  shocks  occurring  at 
different  time  points  must  be  uncorrelatedi  i.e.,  that  if  t*t',  then  (ufc,wt) 
must  be  uncorrelated  with  (u^,,wti).  However,  we  found  in  section  3  that  the 
assumption  that  u^  and  u^.  are  uncorrelated  for  distinct  t,t'  leads  to  a 
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discontinuous  estimator.  Thus  the  assumption  that  they  are  in  fact  correlated 
is  a  central  feature  of  the  model.  The  algorithm  can  be  extended  to  cover 
this  case  (see  Harrison  and  Stevens'  reply  to  Godolphin's  comment),  but  the 
extension  requires  the  inversion  of  a  jxj  matrix  at  the  j'th  step  in  order  to 
obtain  the  conditional  distribution  of  v^  given  v1,...,v^_^.  Moreover,  this 
provides  only  the  relevant  posterior  distributions  for  the  observed  sequence 
of  data  points;  in  order  to  draw  predictive  inferences  about  possible  future 
observations,  it  is  still  necessary  to  invert  an  nxn  matrix. 

The  Kalman  filter  representation  provides  an  interesting  alternative 
statement  of  the  models,  emphasizing  the  similarity  of  the  Bayesian  linear 
model  and  control  theory  models.  However,  unless  the  response  function  is 
naturally  observed  as  a  time  series,  it  is  questionable  whether  the  Kalman 
filter  representation  provides  much  additional  insight;  it  seems,  in  fact,  to 
be  a  rather  awkward  way  to  think  of  general  response  surface  models. 
Nonetheless,  the  representation  would  certainly  be  very  useful  if  the 
computational  efficiency  of  Kalman's  recursive  algorithm  were  applicable;  it 
is  unfortunate  that  this  is  not  so. 


i 
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7.  Examples 


In  this  section  we  will  illustrate  the  results  of  the  previous  sections 
by  using  the  Bayesian  models  to  analyze  two  numerical  examples.  Both  examples 
involve  the  response  to  a  single  input,  and  in  both  cases  the  response  is 
essentially  a  linear  function  of  the  input  over  the  range  the  observations. 
Thus,  the  straight  line  approximating  model  (1.1a)  will  be  used,  with  the 
input  as  the  only  explanatory  variable.  We  will  assign  a  diffuse  prior 
distribution  to  the  regression  parameters,  so  that  the  "standardized"  notation 
of  Corollary  3.2.2  is  appropriate.  All  of  the  calculations  described  in  this 
section  were  performed  on  the  WIRCS/VAX  computer.  The  Bayesian  estimates  were 
calculated  using  the  MATLAB  matrix  laboratory  package  (Holer,  [1981])  while 
the  standard  regression  analyses  were  carried  out  using  the  MINITAB 
statistical  package  (Ryan,  et.  al.  [1981]). 

The  first  example  uses  data  reported  by  Draper  and  Smith  [1981]  from  an 
experiment  to  investigate  the  relationship  between  the  yield  of  a  chemical 
process  and  the  reaction  temperature  (in  coded  units): 

Temp  -5  -4  -3  -2  -1  0  1  2  3  4  5 

Yield  1  5  4  7  10  8  9  13  14  13  18 

A  casual  inspection  of  the  data  is  sufficient  to  conclude  that  yield  generally 
increases  as  a  function  of  temperature.  The  increase  across  the  range  of 
temperatures  in  the  experiment  seems  to  be  more  or  less  linear,  although  there 
is  a  fair  amount  of  variation  from  one  observation  to  the  next,  suggesting 
either  that  experimental  error  may  be  fairly  substantial,  or  that  some 
additional  explanatory  variable (s)  which  was  not  included  in  the  experiment  is 
important  in  determining  yield. 

We  propose  to  model  these  data  by  (2.3): 
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T  -  X0  +  U  +  C 


where  T  is  the  vector  of  observed  yields,  X  is  the  design  matrix  which 
corresponds  to  a  linear  approximating  model  in  temperature,  n  is  the  vector 
of  bias  terms  and  c  is  the  vector  of  experimental  errors.  We  will  assume 

that  the  autocorrelation  function  of  the  bias  process  is: 

2 

(7.1)  R(x1,x2)  ■  exp[-(x1  -  x2)  /s], 

where  s  is  a  parameter  which  we  will  specify.  This  is  a  legitimate 
autocorrelation  function  (by  Bochner's  Theorem  --  see  Chung  [1974],  Chapter  6) 
and  corresponds  to  a  continuous,  second  order,  stationary  Gaussian  stochastic 
process.  Note  that  R(x,x)  °  1  for  all  real  x;  thus,  we  are  assuming  in  (7.1) 
that  the  probable  extent  of  the  bias  is  the  same  at  all  temperatures.  The 
choice  of  s  controls  how  rapidly  the  correlation  drops  off  between  different 
temperatures.  If  s  is  quite  small,  the  decay  is  very  rapid;  this  corresponds 
to  a  prior  belief  that  the  relationship  of  yield  to  temperature  may  have  sharp 
local  deviations  from  the  overall  linear  approximation.  On  the  other  hand,  if 
s  is  large,  so  that  there  is  a  high  correlation  between  the  bias  terms  for 
rather  different  temperatures,  then  (7.1)  will  reflect  prior  belief  that  the 
response  does  not  have  any  wild  local  oscillations.  Thus  the  choice  of  s  is 
closely  linked  to  the  experimenter's  prior  beliefs  about  how  smoothly  yield 
responds  to  temperature. 

The  autocorrelation  function  in  (7.1)  is  very  similar  to  that  proposed  by 

I  1 

Blight  and  Ott:  R(x,z)  ■  X  ',  0<X<1.  The  only  difference  is  the  use  of 

the  squared  difference  rather  than  the  absolute  difference  between  the 
respective  points.  However,  a  possibly  important  consequence  of  this  change 
is  that  it  results  in  a  prediction  equation  which  is  an  analytic  function  of 
temperature  rather  than  a  function  with  a  discontinuous  first  derivative,  as 
results  from  the  Blight-Ott  suggestion. 


44 


Two  different  values  were  chosen  for  each  of  the  parameters  associated 
with  the  bias  covariance  function.  The  parameter  s  in  the  correlation 
function  was  set  at  1  and  16,  so  that  the  correlation  decays  to  1/e  for 
temperatures  which  are  1  or  4  coded  units  apart,  respectively.  The 
experimental  error  to  bias  parameter  X  was  set  at  1  and  0.1,  reflecting 
situations  in  which  bias  is  of  the  same  magnitude  or  ten  times  as  severe  as 
experimental  error,  respectively.  Using  each  combination  of  these  parameter 
settings  gave  four  different  prediction  equations.  A  fifth  prediction 
equation  was  derived  by  using  ordinary  least  squares.  The  parameter  settings 
were  chosen  merely  to  illustrate  the  flexibility  of  the  Bayesian  models  and  to 
illustrate  their  application;  they  do  not  reflect  any  experimenter's  prior 
assumptions  about  the  relationship  between  temperature  and  yield. 

The  four  prediction  functions  described  above  are  graphed,  along  with  the 
OLS  prediction  function,  in  the  four  panels  of  Figure  1;  each  graph  also 
displays  the  observed  data  points.  Overall,  the  five  functions  are  very 
similar  and  are  clearly  dominated  by  a  positive  linear  trend.  Over  the  range 
of  values  graphed,  these  functions  never  differ  by  more  than  a  unit  of 
yield.  In  particular,  an  analyst  using  OLS  regression  would  reach  roughly  the 
same  conclusions  about  the  relationship  between  yield  and  temperature  as  a 
colleague  using  any  of  the  suggested  Bayesian  models. 

The  differences  among  the  prediction  functions  concern  primarily  the 
manner  in  which  they  vary  about  the  linear  trend.  The  most  noticeable 
differences  occur  when  s  is  altered  from  1  to  16,  reflecting  prior  belief  in  a 
smoother  relationship.  The  prediction  functions  bear  convincing  witness  to 
the  effect  of  such  a  prior  belief:  the  two  graphs  with  s*1  display 
considerable  local  variation,  with  fairly  sharp  increases  in  yield  followed  by 
plateaus  and  even  by  decreases.  Although  these  prediction  functions  follow  an 
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Figure  1(d) i  The  Bayesian  prediction  function  for  s*16,  X»0.1  (solid  line)  and  for  OLS 
regression  (dashed  line)  for  the  Draper  and  smith  data.  The  data  points  are  marked 


overall  linear  increase,  their  slopes  vary  considerably  with  temperature.  By 
contrast,  the  two  graphs  with  s«16  are  strictly  monotone  increasing  and  their 
slopes  remain  close  to  the  overall  trend  at  all  temperatures. 

The  experimental  error  to  bias  tradeoff  parameter  X  is  seen  to  have  less 
effect  than  the  correlation  parameter  s  in  this  example.  The  two  curves  for 
s»16,  in  particular,  are  almost  identical  throughout  the  region  of  the 
observations.  Only  for  those  temperatures  where  our  predictions  are 
extrapolations  from  the  experimental  region  are  there  any  noticeable 
differences.  When  s-1,  changing  X  has  a  more  dramatic  effect.  The 
prediction  equation  for  s=1 ,  X«0.1  follows  the  observed  data  very  closely;  at 
each  of  the  experimental  temperatures  its  predictions  are  closer  to  the 
observed  yields  than  any  of  the  other  equations.  In  fact,  this  equation  seems 
to  follow  the  data  "too  well";  it  would  probably  be  very  difficult  to  convince 
the  experimenter  that  the  effect  of  increasing  temperature  on  yield  is  so 
irregular.  It  is  more  likely  that  we  will  be  convinced  that  our  parameter 
settings  are  unrealistic,  exagerating  the  effect  of  bias  in  shifting  the 
observed  yields  off  a  straight  line  when  experimental  error  may  have  been  much 
more  influential.  We  know  from  Theorem  3.3  that  in  the  limit,  as  X-K),  the 
prediction  equation  will  exactly  interpolate  the  data  (since  we  are  using  a 
non-singular  R  here),  and  it  is  clear  that  for  s*1,  decreasing  X  to  0.1  is 
sufficient  to  come  very  close  to  an  interpolant.  With  s*16,  even  this  ten  to 
one  bias  to  error  ratio  is  still  a  long  way  from  interpolating  the  data.  Thus 
we  also  see,  at  least  in  this  example,  an  "interaction”  between  the  two 
parameters . 

The  estimated  regression  parameters  and  their  standard  deviations 
(divided  by  o)  are  given  in  Table  1.  The  parameter  estimates  take  on  an 
added  importance  in  light  of  the  particular  bias  covariance  function  we  are 
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using.  Note  that  Rfx^Xj)  0  as  |x^  -  x2l  +  «.  It  follows  from  (3.5) 
that  for  this  example  the  prediction  equations  will  converge  asymptotically  to 
the  straight  lines  determined  by  the  parameter  estimates.  Thus,  even  if  the 
data  had  given  rise  to  prediction  equations  that  differ  considerably  from  an 
overall  linear  trend  in  the  experimental  region,  the  form  of  the  bias 
covariance  function  guarantees  that  extrapolation  to  points  far  enough  outside 
that  region  will  be  linear  in  temperature. 

The  estimated  regression  parameters  are  similar  for  all  the  models  with 
the  exception  of  s=16,  X=0.1.  This  bias  covariance  specification  estimates  a 
slope  which  is  more  than  10%  larger  than  that  for  any  of  the  other  models  and 
more  than  17%  larger  than  the  OLS  estimate.  One  possible  explanation  for  this 
involves  the  effect  of  the  most  distant  observations  on  extrapolation  when  the 
bias  covariance  is  assumed  to  be  rather  stable.  Intuitively,  if  the  yield 
corresponding  to  the  highest  temperature  is  "unusually"  high,  and  we  think 
that  the  bias  at  proximate  temperatures  is  quite  similar,  then  this  highest 
observation  may  play  a  disproportionate  role  in  determining  our 
extrapolations;  mathematically,  this  is  evident  from  examining  the  terms  in 
(3.5).  Such  an  effect  will  be  strengthened  the  more  we  decrease  X, 

Table  1:  The  estimated  regression  parameters  and  their  standard  deviations 


!  divided  by 

o)  for  the 

four  settings 

of  X  and 

s  and  for 

OLS  regression 

Constant 

(S.D./o) 

Slope 

(S.D./O) 

X-1, 

s«*1 

9.282 

(0.49) 

1.468 

(0.15) 

A»0.1, 

8“1 

9.2'  ’ 

(1.27) 

1.508 

(0.38) 

X-1, 

s*16 

9.288 

(0.75) 

1.516 

(0.19) 

A-0.1, 

s“16 

9.341 

(2.12) 

1.682 

(0.46) 

OLS 

9.273 

(0.30) 

1.436 

(0.10) 
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emphasizing  bias  to  a  greater  degree.  But  we  noted  above  that  extrapolations 
are  also  related  to  the  estimated  regression  coefficients.  In  this  example, 
the  two  most  extreme  temperatures  have  observed  yields  which  suggest  a  steeper 
slope  than  do  the  remainder  of  the  observations.  This  may  account  for  the 
fact  that  all  the  Bayesian  estimates  of  the  slope  are  greater  than  the  OLS 
estimate.  This  effect  becomes  dramatic  when  s=16,  reflecting  prior  belief  in 
a  "stable"  bias  contribution,  and  X=  0.1,  indicating  a  belief  that  bias 
greatly  dominates  experimental  error  in  determining  observed  deviations  from  a 
straight  line  graduating  function. 

The  estimated  standard  deviations  of  the  regression  coefficients,  unlike 
the  coefficients  themselves,  differ  considerably  from  one  model  to  the  next. 
Here  it  is  X  which  has  the  principal  effect.  Decreasing  X  from  1  to  0.1 
increases  the  standard  deviation  of  both  coefficients  (modulo  a)  by  over  2.5 
times.  Increasing  s  also  increases  the  standard  deviations,  but  to  a  more 
modest  extent.  The  effect  of  X  is  really  a  logical  consequence  of  the  prior 
assumptions  of  the  model:  if  a  straight  line  is  our  best  guess  as  to  the 
overall  dependence  of  yield  on  temperature,  but  we  believe  there  may  be  rather 
severe  bias,  then  it  only  seems  reasonable  that  the  experiment  will  be  unable 
to  provide  us  with  precise  estimates  of  the  coefficients  of  the  straight 
line.  However,  an  additional  point  should  also  be  kept  in  mind:  if  we 
believe  that  bias  dominates  experimental  error,  then  for  any  given  data  set  we 
will  presumably  obtain  a  lower  estimate  of  a  which  will  offset  some  of  the 
differences  evident  in  the  table. 

The  Bayesian  models  lead  to  quite  different  conclusions  than  OLS  about 

our  ability  to  predict  yield  as  a  function  of  temperature.  Figure  2  displays 

2 

graphs  of  Varfy^/Y^yJ/o  for  (7.1)  with  s*1  and  16  and  X**1,  and  for  OLS. 
These  functions  are  clearly  symmetric  about  0,  so  the  graphs  extend  only  over 
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non-negative  temperatures.  It  is  clear  from  the  graphs  that  OLS  is  more 
optimistic  about  our  ability  to  predict  yield  than  are  the  Bayesian  models. 
This  is  perfectly  reasonable,  since  the  Bayesian  models  provide  for  a  greater 
degree  of  prior  uncertainty  about  the  experiment.  The  two  Bayesian  models 
shown  in  Figure  2  also  differ  considerably.  The  prediction  variances  with 
s=16  are  much  lower  than  with  s=1.  Again  this  is  a  logical  consequence  of  the 
differing  prior  beliefs  being  modeled.  If  s=16,  the  bias  is  relatively 
stable,  and  the  experiment  permits  us  to  effectively  ascertain  its  effect  at 
least  throughout  the  experimental  region?  however,  if  s=1 ,  we  are  asserting  a 
prior  belief  that  the  bias  may  exhibit  considerable  local  fluctuation  so  that 
only  observations  in  the  immediate  vicinity  of  a  given  temperature  will  really 
be  of  much  use  in  predicting  yield  there.  This  is  the  exact  antithesis  of 
OLS,  where  certainty  about  the  linear  form  of  the  response  function  implies 
that  experimenting  at  the  most  extreme  temperatures  possible  will  lead  to  the 
most  precise  predictions  at  all  temperatures.  It  also  explains  the  local 
minima  that  we  observe  in  the  graph  for  s=1  near  most  of  the  actual 
experimental  temperatures,  while  the  other  graphs  are  strictly  monotone 
increasing. 

Within  the  range  of  the  experimental  data,  the  Bayesian  model  with  s-16 
produces  prediction  variances  about  7%  higher  than  OLS  while  the  model  with 
s”1  produces  variances  about  40%  higher.  However,  recall  that  these  are  the 
variances  of  predicting  a  new  observation  and  thus  they  all  include  a 
contribution  of  1  for  the  experimental  error  associated  with  that 
observsation.  If  we  subtract  that  term  to  obtain  variances  for  the  associated 
expected  value,  the  increases  in  the  variances  over  OLS  are  about  40%  and 
250%,  respectively.  Outside  the  range  of  the  data  the  differences  are  still 
greater.  The  prediction  variances  for  the  Bayesian  models  increase  rapidly 
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for  extrapolating  while  the  OLS  variances  continue  to  increase  quadrat ically. 

The  second  example  involves  an  analysis  of  simulated  data.  Suppose  the 

true,  but  unknown  response  function  is: 

(7.2)  g(x)  =  20[exp(-0.3x)  -  exp(-0.4x)]. 

This  response  function  is  graphed  in  Figure  3.  Response  functions  of  this 

form  often  arise  in  chemical  reactions  when  an  initial  product  A  decays  to  an 

intermediate  product  B  which  then  decays  to  a  final  product  C.  Suppose  an 

experiment  is  conducted  to  investigate  the  behaviour  of  the  response  in  the 

neighborhood  of  x=*5,  where  it  is  known  that  the  response  is  approximately 

linear  in  x.  The  experiment  calls  for  taking  observations  at  11  equally 

2 

spaced  sites  from  x=2.5  to  x=7.5.  We  will  assume  that  0  is  known  in  advance 
to  be  0.01. 

We  propose  to  model  the  observed  data  by  T  *  X0  +  n  +  e,  where  X  is  the 
design  matrix  for  straight  line  regression,  n  is  the  bias  term  and  e  the 
vector  of  experimental  errors.  We  will  model  the  bias  covariance  matrix  by: 
(7.3a)  R(x1#x2)  -  [(x1  -  5)(x2  -  5)/4]exp(-(x1  -  x2)2/2) 

if  (x1  -  5) (x2  -  5)  >  0 
(7.3b)  R(x.j,x2)  *  0  otherwise. 

The  idea  behind  this  covariance  function  is  to  represent  prior  belief  that  a 
straight  line  approximation  is  appropriate  in  the  vicinity  of  x>5,  but  may  be 
increasingly  suspect  as  x  becomes  more  distant  from  5.  This  sort  of 

assumption  is  appropriate  for  many  response  surface  experiments.  Thus  the 

2 

bias  variance  at  x  is  equal  to  (x-5)  /4.  The  division  by  4  is  a 
standardization  device,  so  that  the  choice  of  X  will  reflect  the  experimental 
error  to  bias  ratio  at  x”3  and  x«7,  covering  80%  of  the  experimental  region. 

We  will  suppose  that  at  these  points  bias  is  of  about  the  same  magnitude  as 
experimental  error  so  that  X»1.  The  second  term  in  (7.3a)  reflects  a  prior 
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belief  that  the  covariance  between  the  bias  contributions  at  various  x's  tends 
to  0  as  the  x's  become  increasingly  distant  from  one  another.  The  denominator 
under  the  square  in  the  exponential  term  controls  the  rate  at  which  this 
happens  and  has  been  chosen  to  reflect  a  reasonble  belief  about  the  smoothness 
of  the  response  function.  A  possible  drawback  to  (7.3)  is  that  for  any  given 
x.j,  R(  x,  x^ )  is  a  continuous  function  of  x,  but  its  derivatives  are 
discontinuous  at  x=5.  Thus  the  derivatives  of  the  prediction  equation  will 
also  be  discontinuous  at  x=5.  Also,  note  that  R(5,x)=0  for  all  x,  and  that 
x=5  is  one  of  the  observation  sites.  Thus  the  R  matrix  for  this  model  will  be 
singular,  so  that  some  of  the  results  of  Theorems  3.3  and  3.4  will  not  apply 
to  this  example. 

To  justify  that  (7.3)  is  a  legitimate  covariance  function,  consider  two 
independent  stochastic  processes,  each  beginning  at  x=5  but  moving  from  there 
in  opposite  directions  on  the  x  axis.  Suppose  that  each  process,  on  those  x's 
for  which  it  is  defined,  has  the  covariance  function  given  by  (7.1)  with 
s=2.  Define  two  new  stochastic  processes  by  multiplying  each  of  the  original 
processes  by  (x-5)/2.  Since  both  of  the  new  processes  are  constrained  to  be 
deterministically  0  at  x=5,  we  may  tie  them  together  into  a  single  stochastic 
process  defined  on  the  entire  real  line.  This  new  stochastic  process  is 
easily  seen  to  have  (7.3)  as  its  covariance  function. 

The  "experimental  data"  for  this  example  were  obtained  by  calculating  the 
actual  function  values  at  the  design  points  from  (7.2)  and  then  adding  to  them 
computer  generated  random  errors.  The  random  errors  were  generated  using  the 
normal  distribution  random  number  routine  in  MINITAB  (Ryan,  et.  al.  [1981]). 
The  function  values  and  the  observations  are  listed  in  Table  2.  The 
observations  are  also  included  on  the  graph  of  g(x)  in  Figure  3. 

The  Bayesian  prediction  equation  and  the  OLS  prediction  line  are  both 
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graphed  in  Figure  4.  They  are  very  similar  to  one  another,  although  between 

7.5  and  8.5  there  is  a  mild  discrepancy.  This  evidently  results  from  the 

effect  of  the  low  observation  at  7.5  on  the  Bayesian  line,  in  much  the  same 

manner  as  we  discussed  for  the  last  example.  The  estimated  regression 

coefficients  and  their  standard  errors  are  listed  in  Table  3.  The  standard 

2 

errors  were  calculated  using  the  assumed  known  value  a  =0.01.  The 

estimates  are  almost  identical,  but  as  in  the  previous  example,  the  standard 

errors  are  markedly  greater  with  the  Bayesian  model. 

2 

Since  o  is  assumed  to  be  known  here,  we  may  use  the  predictive  model 

check  described  in  section  5.  The  weighted  residual  sum  of  squares  is  0.1482, 
2 

and  dividing  by  a  we  obtain  the  checking  statistic  14.82.  The  upper  10% 

Table  2:  The  true  response  function  values  g(x)  and  the  simulated  data  points 
Y  for  the  second  example. 


g(x) 

Y 

2.5 

2.0897 

2.2079 

3.0 

2.1075 

2.2087 

3.5 

2.0668 

1.8612 

4.0 

1.9860 

1.9903 

4.5 

1.8788 

1.7798 

5.0 

1.7559 

1.9895 

5.5 

1 . 6249 

1.4901 

6.0 

1.4916 

1.6316 

6.5 

1.3600 

1.3449 

7.0 

1.2329 

1.3037 

7.5 

1.1122 

1.0837 
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point  of  the  Xg  distribution  is  14.68,  so  that  there  is  about  a  20%  chance, 
if  the  model  is  valid,  of  obtaining  so  extreme  a  value  for  the  checking 
statistic.  This  may  raise  some  doubts  about  the  proposed  model,  but  is 
certainly  not  convincing  proof  of  model  inadequacy.  Moreover,  closer 
inspection  reveals  that  the  simulated  random  errors  are  "a  bit  large"  for  the 
N(0,0.01)  distribution  that  was  used  —  the  sample  standard  deviation  of  the 
errors  is  0.13  rather  than  0.10  —  and  this  has  helped  to  inflate  the  checking 
statistic.  Since  other  residual  checks  do  not  point  to  any  gross  model 
inadequacies,  it  seems  reasonable  to  accept  the  validity  of  the  Bayesian 
analysis. 


Table  3s  The  estimated  regression  parameters  and  their  standard  deviations 
for  the  Bayesian  model  and  for  OLS  regression. 


Constant 

(S.O. ) 

Slope 

(S.D. ) 

Bayesian  model 

2.815 

(0.188) 

-0.213 

(0.037) 

OLS  regression 

2.790 

(0.100) 

-0.214 

(0.019) 
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8.  Conclusions 


Box  [1982]  observes  that  "all  models  are  wrong;  some  models  are  useful* 
This  aphorism  must  be  particularly  true  of  empirical  models  such  as 
polynomials  that  make  no  claim  to  do  more  than  locally  approximate  the  true 
function."  This  paper  has  presented  and  analyzed  several  suggestions  to 
modify  standard  response  surface  models  to  take  into  account  the  inherently 
approximate  nature  of  commonly  used  graduating  functions:  the  hierarchical 
model  of  Smith  [1973],  the  "polynomial  approximation  +  bias"  model  of  Blight 
and  Ott  [1975],  the  "localized  regression  model"  of  O'Hagan  [1978]  and  the 
smoothing  spline  approach  of  Wahba  [1978]. 

The  major  result  of  the  paper  has  been  to  demonstrate  that  these  four 
models  are  essentially  equivalent.  This  has  helped  make  possible  a  complete 
analysis  of  the  consequences  of  these  models  for  estimating  response  surfaces, 
synthesizing  and  expanding  upon  the  results  which  had  been  proved  for  each 
individual  model.  The  estimates  depend  on  the  magnitude  of  biaB  relative  to 
experimental  error,  with  ordinary  least  squares  regression  obtaining  in  the 
"special  case"  when  it  is  assumed  that  there  is  no  bias  at  all.  A  predictive 
check  of  the  model  has  been  developed  and  is  especially  useful  for  criticizing 
assumptions  about  the  ratio  of  bias  to  experimental  error.  Additionally,  the 
relationship  between  these  models  and  the  Kalman  filter  has  been  explored. 
Although  the  models  can  be  written  as  a  Kalman  filter,  this  does  not  seem  to 
be  a  useful  approach  for  most  response  surface  models. 

Much  effort  has  been  devoted  in  recent  years  to  developing  robust 
statistical  procedures,  that  is,  statistical  procedures  which  will  give 
reliable  answers  even  when  assumptions  about  the  experimental  mechanism  (i.e., 
the  proposed  statistical  model)  are  inexact.  It  is  useful  to  view  the  models 
discussed  in  this  paper  in  the  context  of  the  robustness  literature.  All  four 
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models  attempt  to  provide  more  reliable  answers,  but  not  by  altering  our 
methods  to  make  them  less  sensitive  to  faulty  models;  rather,  by  recognizing 
the  approximate  nature  of  empirical  graduating  functions,  they  strive  to 
provide  a  more  realistic  model  whose  assumptions  more  accurately  reflect  what 
we  actually  believe  about  the  experimental  data  observed.  This  is  consistent 
with  the  viewpoint  advocated  by  Box  [1980],  who  characterizes  robustif ication 
as  the  "judicious  and  grudging  elaboration  of  the  model  to  ensure  against 
particular  hazards".  The  models  provide  a  useful  and  flexible  method  for 
studying  the  adequacy  of  empirical  response  functions. 
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Appendix 

We  now  prove  the  various  distributional  results  stated  in  sections  3,  4 
and  5.  Host  of  the  proofs  rely  on  the  following  lemma,  which  gives  the  joint 
distribution  of  the  observed  responses,  the  regression  parameters  and  a  future 
(as  yet  unobserved)  response  at  x. 

Lemma  3.1:  Let  T  denote  the  response  vector,  a  future  response  at  x  and 
•2  the  regression  parameters.  Then,  under  the  model  described  by  (3.1),  the 
joint  probability  distribution  of  (Y',Y^, 62')  is  multivariate  normal  with 
mean  vector: 

(A.i)  E{(Y*,Yjt,e21)}  *  <<x80)\<x*#0)\80') 

and  variance-covariance  matrix: 


o2i  +  V  +  XV  X' 
n  1 

V  +  xv^ 

W1 

(A. 2) 

cov{(Y\Yx,82')}  = 

v'  + 

a2  +  v  +  r'v^x 

*'V1 

V 


V 


1 


Proof :  By  Theorem  2.1,  we  may  write  the  hierarchical  model  (3.1)  in  terms  of 
the  Blight-Ott  formulation  (2.3)-(2.4): 

y  -  y82  +  n  +  e, 

Yx  *  *’®2  +  nx  +  £x* 

where  (h\h  )'  ~  N(0,V*),  ( e' ,  e  )'  ~  N(0,  o2I  ),  and  8_  ~N(|L,V,),  and 

X  X  n*r  1  A  U  1 

where  these  three  random  vectors  are  independent.  Equations  (A.I)  and  (A. 2) 
now  result  from  simple  and  straightforward  computation. 

Theorem  3. 1 : 


(3.2)  E{Yx/Y-y}  *  *’80  +  (▼'  +  x’^X'Ho2^  +  ▼  +  XV1X')"1(y  -  X0O). 

(3.3)  E{82/Y«y}  ”  *0  +  (O2^  +  V  +  XV^'  )"\y  -  X8Q). 

Proof:  These  equations  follow  directly  from  Lemma  3.1  by  applying  standard 
formulas  for  conditional  expectations  of  partitioned  multivariate  normal 
vectors  (see,  for  example,  Anderson  [1958],  p.  28).  The  only  condition  which 
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must  be  checked  in  order  to  apply  these  formulas  is  that  the  covariance  matrix 


in  (A. 2)  is  non-singular.  Equivalently/  we  can  show  that  for  any  non-zero 
vector  a1  “  (a^,a2,a^),  the  random  variable  (Y' ,Y^,  does  not  have  a 
deterministic  distribution.  But 

(y,'Ve2)a  =  a1T  +  a2yx  +  a382 

"  +  a2Z‘  +  *3)82  +  aln  +  a2\  +  ai*  +  Vx* 

Now  e  and  are  independent  of  each  other  and  all  the  other  terms,  so  any 

vector  a  with  non-zero  entries  in  either  a^  or  a2  must  give  rise  to  a  random 
variable  with  positive  variance.  If  these  entries  are  all  zero,  but  a  is  non¬ 
zero,  then  a  must  have  at  least  one  non-zero  entry  in  again,  from  the 
above  equation,  it  is  clear  that  the  resulting  random  variable  must  have 
positive  variance. 

Lemma  3.2:  (o2I  +  V  +  XY^X’  )_1  -  M-1  -  +  V?1  J^X’m"1  . 

~ —  n  i  l 

where  M  “  o2I  +  V. 

n 

Proof :  The  desired  inverse  matrices  exist  because  of  the  assumption  that  V 
and  ▼1  are  covariance  matrices  and  hence  are  positive  definite.  The  lemma 
then  follows  as  a  special  case  of  the  more  general  "matrix  lemma"  proved  in 
Lindley  and  Smith  [1972]. 

Lemma  3.3:  V^'fM  +  XI^X'  )_1  -  (X'M-1X  +  v”1)"1!'*”1. 

Proof :  By  Lemma  3.2, 

VjX'  (M  +  XP^')"1  -  [m"1  "  H“1X(X'H-1X  +  V*1)"1X,ll"1] 

-  VjX'if1  -  v1x'k“1x(x'ii"1x  +  v'Wif1 

■  V*'1  -  *,[*•■■'«  *  -  *;’ ]«•«■’*  ♦  v;1)-1*'.-’ 

-  (xV’x  +  v;VW\ 
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Theorem  3.2; 


(3.6)  lim  e{y  /I«y}  =  s' (X,n"1X)"1X,ll"1y 

v’+O  * 

1  +▼' [u-1  -  M~1X(X,Il"1X)~1X,ll“1  ]y  and 

(3.7)  lim  e{« ,/T-y}  =  (X'Il“1X)"1X,M"1y* 

2vi  -° 

where  M  =  a  I  +  V. 

n 

Proof :  These  equations  follow  directly  from  (3.2)  and  (3.3)  upon  application 
of  Lemma  3.3  and  some  simple  algebra.  The  derivations  are  virtually 
identical,  so  we  will  prove  only  (3.7).  From  Theorem  3.1, 

E{«2/»=y}  -  *0  +  ▼1X'(o2In  +  V  +  XV1X')_1(y  -  XpQ) 
so  by  Lemma  3.3, 

E{e2/T*y}  -  +  (X'M-1X  +  v"1)“1X'M"1(y  -  XSQ) 

and  as  converges  to  0,  this  clearly  converges  to 

aQ  +  (X'n“1X)“1X'M'1(y  -  X*0) 

-1  -1  -1 
-  (X’*  X)  X'M  y. 

•  * 

Theorem  3.3;  Denote  V  »  TR  .  Then  we  have  the  following  limiting  forms  as 

X  *  *•  : 

(3.11)  lim  lim  e{y  /T*y}  *  *' (X'X)-1X'y. 

X-m»  V^+0 

(3.12)  lim  lim  E{0/X-y}  -  •(X'X)“1X'y. 

x-h»  v1  +o 

If  R  is  non** singular,  then  we  have  the  following  limits  as  X  ♦  0  : 

(3.13)  lim  lim  e{y  /T*y}  -  m'  (X'r”1!)"1!'*-^ 

X+0  v“'-K>  X 

+  r'{R_1  -  R_1X(X,R"1X)"1X'R"1>y 

(3.14)  lim  lim  e{0 /T*y}  -  (X'R"1X)"1X,R"1y. 

X+0  V1  -*>0 
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Proofs  From  Corollary  3.2.2, 

lim  e{y  /*■*}  -  «,(X»c“1xf1X,C_1y 

V  >0 

1  -1  -1  -1  -1  -1 

+  r'{c  -C  X(X'C  X)  X'C  }y 

where  C  «  XI  X 
n 

-  *' (X,D“1X)_1X,D_1y 

+  X_1r' {D_1  -  d"1X(X*D_1X)"1X'd"1 }y 

where  D  »  I  +  X  1R.  As  X  -►  «»,  D  +  I  ,  and  (3.11)  results.  An  Identical 
n  n 

derivation  leads  from  (3.10)  to  (3.12).  To  obtain  (3.13)  and  (3.14),  note 

that  if  R  is  non-singular,  then  C  1  “  ( XI  +  R)  1  ♦  R  1  when  X  ♦  0.  Thus 

n 

(3.13)  and  (3.14)  follow  directly  from  (3.9)  and  (3.10),  respectively. 

A 

Theorem  3.4:  Denote  by  Y  the  predicted  value  vector.  Then: 

(3.15)  Y  -  [a“2In  +  (T  +  XV1X,)_1]"1t0_2y  +  (▼  +  X^X'  j"1*^) 

(3.16)  E{e2A-y}  -  [x*  (o2In  +  V)-1X  +  (X*  (  o2In  +  V)-1y  +  v"1^]. 

Proof :  We  will  use  Smith's  model  (2.1),  along  with  the  results  of  Lindley  and 
Smith  [1972],  to  prove  this  theorem.  First,  from  (2.1a),  we  know  that 

e{y/8.|}  =  8^  Thus  we  can  calculate  Y  as: 

Y  =  E^/Y-y}. 

Equation  (3.15)  is  thus  a  special  case  of  the  theorem  proved  by  Lindley  and 
Smith  (equations  (12)  and  (13)).  Similarly,  we  can  use  the  lemma  proved  by 
Lindley  and  Smith  to  verify  (3.16).  Combining  (2.1a)  and  (2.1b),  we  obtain 
the  distribution  of  Y  conditional  on  9^  without  the  mediating  parameter 

8  • 

T 

Y/e,  -  N(X8  ,02I  +  V). 

Sm  n 

This  equation,  together  with  (2.1c),  matches  the  assumptions  of  the  lemma,  and 
equations  (7)  and  (8)  of  Lindley  and  Smith  give  (3.16). 

A 

Theorem  3.5:  Y  solves  the  minimisation  problem:  find  a  to  minimise 

(3.17)  (n-y)'(u-y)  +  (ti-X^)'  [o2(V  +  X^X’  f1)  («-X^). 
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If  V1  =  0,  (3.17)  becomes: 

(3.18)  (u-y)' (u-y)  +  Xu* [R-1  -  R~1*(X'u"1X)~1X'R~1 ]u. 

Moreover,  the  second  term  in  (3.18)  is  0  if  and  only  if  u  e  col(X). 

Proof :  As  in  the  proof  of  Theorem  3.4,  we  use  the  fact  that  Y  is  equal  to 
the  posterior  expectation  of  8^.  Combining  (2.1b)  and  (2.1c),  we  see  that 
the  prior  distribution  of  8^  is: 

8t  -  N(X8q,V  +  X^X'  ). 

The  conditional  distribution  of  Y  given  8^  is  given  by  (2.1a)  as: 

Y/8.  ~  N (  8. , o2 1  ). 
i  in 

Applying  Bayes*  Theorem,  we  find  that  the  posterior  probability  density 
function  of  81  is  proportional  to: 

exp{-ta-2(y-81)*(y-ei)  +  <8t  -  X8Q )  *  (▼  +  XV^- )"’ (  81  -  XBQ)]/2} 

=  expJ-Q^)^}. 

This  is  clearly  the  p.d.f.  for  a  normal  distribution,  and  its  expectation  can 
be  found  by  minimizing  the  quadratic  form  0(8^.  But  minimizing  0(8^  is 
equivalent  to  minimizing  (3.17).  Thus,  the  posterior  expectation  of  8 and 

A 

hence  Y,  minimizes  (3.17).  To  derive  (3.18),  we  require  the  simple  matrix 
identity  (▼  +  X^X')"1  *  v'1  -  V^XfX'v"1!  +  v”1  ^X'V"1 . 

Substituting  this  into  (3.17),  we  obtain: 

(wy)'(u-y)  +  (u-XB0)*  [a2(V  +  XV^' j"1)  (u-XB0) 

-  (u-y) '  (u-y)  +  (u"X80)’o2[v"1  -  v”1x(x*y”1x  +  v”1  T’x'v-1  ]  (u-x80) 

2  -1 

and  using  the  standardized  form  V  =  o  K/X  and  the  assumption  that  =  0, 

-  (u-y)*  (u-y)  +  X(u-X80)' [R*1  -  r"1X(X'r_1X)'1x*r"1]  (w-X80). 

But  X*(R-1  -  r"1X(X*r“1X)“1X'R-1]  =  [R_1  -  R_1X(X'r"1X)"1X*r"1 )X  =  0, so  that 
we  obtain: 


(u-y) * (u 


■y)  +  Xu’ [R-1  -  r"1X(X'r“1X)”1X'R_1)u. 


This  is  precisely  <3.18).  If  u  t  cold),  then  u  =  X(  for  some  vector  0* 
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Then  the  equations  four  lines  above  imply  that  the  second  term  in  (3.18)  is 
0.  On  the  other  hand,  suppose  that  u  is  a  vector  such  that  the  second  term 
above  is  0.  Note  that  this  term  has  the  form  u'An,  where  A  is  a  positive 
semi-definite  matrix.  The  quadratic  form  can  be  0  only  if  Au  *  0.  This  then 
implies  that:  0  *  RAu  =  [i  -  X(X*R_ 1X)-1X,r'-1 ]u.  The  matrix 

I  -  X(X'R  *X)  ^X'R  *  is  easily  seen  to  be  an  idempotent  matrix  which 
projects  into  the  orthogonal  complement  of  col(X).  Thus,  the  above  condition 
implies  that  u  e  col(X). 

Theorem  4. 1 t 

(4.1)  Var{Yj£/»-y}  -  o2  +  v  +  *'[X'(o2ln  +  V)_1X  +  v"1]-1* 

2  -1 

-  2v'(0  I  +  ▼  +  XR.X*  )  XB.* 

n  11 

-  V«X2In  ♦  V  +  X^X'fV 

(4.2)  Var{e2/r-y}  =  [X'{a2ln  +  V)-Ix  +  v"Y\ 

Proof;  These  equations,  like  those  of  Theorem  3.1,  result  from  Lemma  3.1  by 
applying  standard  formulas  for  partitioned  multivariate  normal  vectors. 

Lemma  5.1:  Given  the  prediction  equation  of  Corollary  3.2.2,  the  residual 
vector  is  given  by; 

(5.5)  e(X)  -  B(X)y, 

where  B(X)  =  X[c-1  -  c“1X(X*c’1X)"1X,C~1  ]  and  C  -  XI  +  R. 

n 

Proof :  From  (3.9),  the  predicted  value  vector,  as  a  function  of  X,  is: 

T(X)  -  A(X)y, 

where  A(X)  =  X(X,c“1X)’*1X'c"1  +  r{c_1  -  c”1X(X,c"1X)”1X,C_1  }. 

It  is  easy  to  verify  that  XtX'c^xrYc-1  -  I  -  X“1CB(X).  Substituting 
this  identity  into  the  equation  above  we  obtain: 

A( X)  -  Ir  -  X-1CB(X)  +  X-1RB(X) 

*  In  +  X-1(R  -  C)B(X). 
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But  C  ■  Xl  +  R  so  that* 
n 

A(X)  «  I  -  B(X). 
n 

The  residual  vector  is  thus  given  by: 

e(X)  =  y  -  Y(X)  *  y  -  [lR  -  B(  X)  ]y  *■  B(  X)y. 

Theorem  5.1s  Define  h(X)  -  (t-XJJ'(o2I  +  X-1  a2R  +  XV.X'  >”1  (Y-Xfc. ) ,  and  let 
■ —  “  on  i  u 

h  (X)  *  lim  h(X)e  Then: 

(5.6)  1  h  (X)  *  [B(X)Y]'(Xn  +  X-1R) [b( X)t]/02,  and 

(5.7)  lim  h*(X)  *  RSS/O2, 

X+“> 

where  RSS  is  the  residual  sum  of  squares  from  ordinary  least  squares 


regression. 

2  ~12 

Proofs  We  will  apply  Lemma  3.2,  noting  that  M  m  a  1^  +  V  «  X  o  C.  We  then 
obtain: 

h(  X)  »  <Y  -  Xfl0)'(a2In  +  X"Vr  +  )‘1(T  -  X^) 

-  (Y  -  xB0)' {xo”2c_1  -  Xo‘2c”1x(x'Xo“2c'*1y  +  v“1)“1x,Xe‘’2c“1  }* 

(Y  -  XPQ) 

and  as  1  converges  to  0,  this  clearly  converges  to: 

Xo"2(y  -  xa0)'{c-1  -  c”1x(x,c”1x)"1x'c”1}(Y  “  X3fl) 

-  Y'B(X)Y/02. 

Finally,  it  is  easy  to  show  that  X  +  X  ^R  is  a  generalised  inverse  of 

B( X) j  that  is,  B(X)(I  +  X  ^R)B(X)  *  B(X).  Substituting  this  into  the  last 

n 

line  and  noting  that  B(X)  is  symmetric  gives  us  (5.6).  In  order  to  prove 
(5.7),  we  may  calculate  the  limit  of  (5.6)  directly.  Alternatively,  recall 
from  Theorem  3.3  that  as  X  +  •,  the  Bayesian  estimates  converge  to  the 


ordinary  least  squares  estimates.  Correspondingly,  the  residual  vector 
B(X)y  converges  to  the  OLS  residual  vector.  The  weighting  matrix  clearly 
converges  to  the  identity.  Hence,  (5.6)  converges  to  the  residual  sum  of 


squares  for  OLS  regression. 
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Lemma  5 . 2 i  Given  model  (2.1),  let  T  denote  the  vector  of  predicted  values. 


li^»  e{(Y  -  T)'(In  +  X_1R)(Y  -  Y)}  -  (n-p) i 


Proof :  He  use  Theorem  3.1  to  find  Y,  using  the  standardized  notation 

▼  -  X-1o2R: 

y  -  *P0  +  U_,o2r  +  H^'Ko2^  +  x'Vr  +  x^x'T’ty  -  x^). 

2  -12 

Adding  and  subtracting  o  1^  from  X  0  R  ♦  XV^X1,  we  finds 

Y  -  *B0  +  y  -  «0  -  02.(<J2In  +  *“Vb  +  !Y1X')"1(y  -  XBQ) 

-  y  -  a2(o2I  +  X~Vr  +  XV.X' )-1(y  -  XjL),  so  that 

H  10 

Y  -  Y  -  C2(a2l  +  X-1C2R  +  XV,X')-1(Y  -  x(L),  and 

n  10 

e{(Y  -  Y)(Y  -  Y)'}  -  04(02ln  +  X_102R  +  XV^X'  )_1E  { ( Y  -  XBQ )  (Y  -  > '  }* 

<02I  +  X~Vr  +  XY.X')-1 

-  o\o\  +  X-Vr  +  x^x’)-1, 

since  from  Lemma  3.1,  we  know  that: 

E  {  (Y  -  XB0)(Y  -  XB0)'}  -  <02In  +  x”1 02R  +  XV^X*  ) . 

He  then  have  that  for  any  nxn  matrix  H: 

A  A  a  A 

e{(Y  -  Y)'1I(Y  -  Y)}  -  trace  [HE  {(Y  -  Y)  (Y  -  Y)’{] 

-  o4 trace  [h(  o2I  +  X-1  o2R  +  XV,X' )_1  ]. 

Applying  Lemma  3.2  to  the  expression  above,  and  using  the  fact  that  the  limit 
and  the  trace  may  be  interchanged,  it  is  easy  to  verify  thatt 


A  A  M 

lilf  e{(Y  -  Y)'W(Y  -  Y)}  -  o  trace [hb(  X)  ]. 


To  complete  the  proof  we  need  only  show  that  trace [(X  +  X  R)B(X)1  -  n-p. 

n 

But  this  follows  immediately  from  the  fact  that  (I  +  X  2R)B( X)  is  an 

n 

idempotent  matrix  which  projects  into  the  orthogonal  complement  of  col(Z), 


i 


Theorem  5.2:  Given  the  sampling  model  T  -  N(X0,  +  x"1^),  and  assuming 

—  n 

2 

that  o  is  known,  the  appropriate  lack  of  fit  statistic  for  the  model 
e{y}  -  X0  iss 

(5.6)  h*(X)  -  [b(X)y]'(I  +  X_1R)  [b(X)y]/o2. 

2 

The  sampling  theory  distribution  of  (5.6)  under  this  model  is  X^-p* 

Proof:  Under  the  above  sampling  theory  model,  we  use  generalized  least 
squares  to  find  the  predicted  value  vector: 

Y  -  x[x*(I  +  X-^R)""1xJ~^X‘  (I  +  X_1R)-1Y. 

*■11  n 

It  is  now  easy  to  show  that  the  corresponding  residual  vector  can  be  written: 

Y  -  Y  “  (I  +  X_1R)B(X)Y. 

n 

The  appropriate  lack  of  fit  statistic  for  this  model  is  the  weighted  residual 
sum  of  squares,  with  the  weight  matrix  inversely  proportional  to  the  variance 
matrix: 

*  2  -12-1  * 

(Y  -  Y)'(0  I  +  X  O  R)  (Y  -  Y) 
n 

-  [(I  +  X-1R)B(X)y]'(I  +  X-1R)”1[(I  +  X-1R)B(  X)y]/o2 

n  ■*  n  k  n 

-  [b(  X)y]  *  dn  +  x_1r)  [b(X)y]/o2. 

* 

Corollary  5.2.1:  h  (X)  is  monotone  increasing  in  X. 

Proof:  Let  u  be  any  vector  and  consider  the  function: 

f(X)  -  u'  (i  +  x’W 
n 

Since  R  is  a  positive  definite  matrix,  it  is  clear  that  for  every  u,  f(X)  is 

a  monotone  increasing  function  of  X.  Let  us  denote  by  e( X)  the  vector  of 

residuals  from  using  generalized  least  squares  on  the  sampling  theory  problem 

of  Theorem  5.2.  We  found  in  proving  that  theorem  that 

h(X)  “  e(X)(I  +  X  *R)  ^(XJ/o2.  Now  let  X_  >  X, »  we  wish  to  show  that 
n  2  1 

h(X2>  >  h(X^).  We  have: 

h(X2)  -  e(X2)'dn  +  X21R)_1m(X2)/o2 
>  •<X2)*dn  +  x"1)_1e(X2)/o2 


71 


by  the  general  monotonicity  property  described  above  for  f(X) 

>  e(X  ) '  (I  +  x"1R)‘1e(Xi)/o2  -  MX,) 

i  n  1  i  i 

because  e( X^ )  is  the  residual  vector  for  the  generalized  least  squares 
estimator  when  1  and  hence  gives  a  smaller  weighted  residual  sum  of 

squares  than  does  eCX^). 
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